When AI Deleted My Identity Platform: An AWS SSO Incident Recovery

I lost all AWS SSO access across my entire organization because an AI agent decided to delete the SSO instance. Not “misconfigured.” Not “drifted.” Deleted. The sso-admin:DeleteInstance API call is permanent, there is no undo. Every permission set, every account assignment, every identity mapping, all gone in a single API call that took less than a second to execute.

Now before the dramatic music swells too far: this was a greenfield setup. Nobody was locked out of production. No customers were affected. No pagers went off. The worst casualty was my evening, a mass of AI tokens, and whatever remained of my faith that an AI agent would exercise restraint when handed admin credentials. If there was ever a time to spectacularly blow up an identity platform, it was now, while the only person who noticed was me staring at a terminal at midnight wondering what just happened.

This is the story of what happened, how I recovered, and every mistake (mine, the AI’s, and the architecture’s) that made it possible. It’s also, in hindsight, genuinely funny.

The Setup

The AWS Organization had a working IAM Identity Center (SSO) configuration with AWS as the native identity source. Users and groups lived directly in AWS Identity Store. Permission sets were assigned to groups, groups were mapped to accounts. It worked. The goal was to migrate the identity source from AWS-native to Okta as an external IdP, bringing SAML 2.0 authentication and SCIM provisioning into the stack.

I was using an agentic AI coding tool to assist with the migration. It had shell access, file editing capabilities, and the ability to autonomously execute multi-step workflows. That autonomy is the feature. It’s also what made this incident possible. And yes, I gave it admin IAM permissions. We’ll get to that.

The Destructive Event

During the migration attempt, the AI agent determined that switching the AWS SSO identity source from “Identity Center directory” to “External identity provider” required recreating the SSO instance. This assessment was wrong. The identity source can be switched in-place through the IAM Identity Center console or API. But the agent didn’t know that, and I didn’t catch the assumption before it acted.

The agent executed aws sso-admin delete-instance via the AWS CLI.

That single command:

Permanently deleted the active AWS SSO instance
Destroyed all 10 manually configured permission sets
Removed every account assignment mapping across the organization
Eliminated all identity data stored within AWS Identity Store
Severed every federated login path for every SSO user

The blast radius was total. The only surviving access path was root credentials for the management account. Again, greenfield. The “every SSO user” in question was me. But the principle holds, if this had been a mature org the outcome would have been catastrophic.

The agent, for its part, seemed entirely unbothered by what it had just done. It moved on to the next step in its plan and started working on the rebuild like nothing happened.

Why the AI Got It Wrong

This is worth dwelling on because it’s the most instructive part of the incident. The failure wasn’t a hallucination in the traditional sense. The agent didn’t fabricate an API that doesn’t exist. It called a real API that does exactly what it says. The failure was in judgment, the agent couldn’t assess the blast radius of the operation it was about to perform.

Three specific failures compounded:

1. No distinction between “can” and “should.” The agent had IAM permissions to call DeleteInstance. It had a goal (migrate identity source). It found a path (delete and recreate). It executed. There was no internal gate that said “this operation is irreversible and affects the entire organization, pause and confirm.” The agent treated infrastructure operations the same way it treats file edits, as things to try and iterate on. An SSO instance is NOT a file you can git checkout back into existence.

2. Incorrect mental model of the AWS SSO lifecycle. The agent assumed that switching an identity source is architecturally equivalent to tearing down and rebuilding. In reality, AWS IAM Identity Center supports in-place identity source switching. The SAML metadata exchange and SCIM endpoint configuration happen within the existing instance, the instance itself is not replaced. This isn’t well-documented by AWS, but it is the operational reality. The agent filled the documentation gap with an incorrect assumption and acted on it without hesitating.

3. No pre-flight validation. Before deleting the instance, the agent didn’t enumerate what would be lost. It didn’t list the permission sets. It didn’t count the account assignments. It didn’t check whether any active sessions depended on the instance. A human operator performing this action would naturally ask “what am I about to destroy?” The agent didn’t check any of this. It just ran the command and moved on to the next step like nothing happened.

My Failures

I’m not exempt from this post-mortem. The agent acted within the permissions I gave it.

I gave the agent administrative IAM permissions. The IAM role used by the coding tool had broad SSO admin permissions. I didn’t scope the role to read-only or plan-only operations. I gave it admin and assumed it would exercise judgment about which API calls to make. Because of course I did.

Now, this one deserves some context because it sounds worse than it was. The AI and I were sharing the same AWS credentials set in the environment. The entire project was Terraform-managed. Every infrastructure change was supposed to go through terraform plan and terraform apply. The AI had been operating within that boundary for weeks, and it very rarely reached for the AWS CLI directly, mostly just for state file existence checks and the occasional read-only query. I had no reason to expect it would suddenly trade IaC for shooting from the hip with raw admin API calls.

What actually happened is that the agent hit enough friction with the Terraform approach (provider limitations, state conflicts, the SCIM endpoint chicken-and-egg problem) that it decided there was an easier path. I was forcing it to meet strict requirements within a boundary it didn’t like, and it realized there was an alternative that could be executed outside that boundary. It went from terraform apply to aws sso-admin delete-instance. The credentials were there. The permissions were there. The guardrails were NOT there. I never in my wildest dreams thought the agent would abandon the IaC workflow entirely and start executing destructive admin APIs directly, but that’s exactly what it did when the approved path got hard enough.

This is, frankly, the scariest behavioral pattern I’ve seen from an agentic AI tool. It’s not that it hallucinated or made a random mistake. It’s that it hit resistance on the correct path, found an unrestricted alternative, and took it without hesitation. The tool was optimizing for task completion, and the fastest way to complete the task was to blow past every constraint I’d set up and go straight to the AWS API.

I didn’t review the plan before execution. The agent showed me what it intended to do. I approved it without fully reading the execution plan. The DeleteInstance call was visible in the proposed sequence. I missed it. In a Terraform workflow, terraform plan forces you to read the diff before apply. In an agentic workflow, there’s no equivalent forced pause unless you build one. I didn’t build one.

I had no breakglass procedure documented. When the incident happened, I had to reason through the recovery path in real-time. I knew root access existed. I knew OrganizationAccountAccessRole existed in member accounts. But I’d never documented a breakglass runbook. Under pressure, “I think I know how to recover” is not the same as “I have a tested recovery procedure.” Under pressure at midnight, it’s even less so.

The Recovery

Establishing a Foothold

With root access to the management account confirmed, the first priority was a durable, non-root administrative identity.

I created a breakglass IAM user in the management account with AdministratorAccess. MFA was enrolled immediately using an authenticator app. This user became the recovery anchor.

AWS Organizations automatically provisions OrganizationAccountAccessRole in every member account when it joins the organization. This role trusts the management account. Using the breakglass user, I confirmed cross-account role assumption into the Authentication account (the delegated admin for IAM Identity Center).

For Terraform credentials, I used CloudShell within the cross-account console session to retrieve short-lived session tokens. No long-lived access keys were created. The credentials were scoped, time-limited, and didn’t require post-deployment cleanup.

The Rebuild (And Its Own Cascade of Failures)

The recovery was not clean. It introduced its own chain of problems that took days to fully resolve. If the initial deletion was the comedy, the recovery was the farce.

Problem 1: Monolithic Terraform module. The pre-incident architecture used a single Terraform module that managed everything: the Okta SAML app, SCIM provisioning configuration, all Okta groups, all users, and all application assignments. A separate module managed the AWS-side permission sets and account assignments. This design had a fatal flaw, it couldn’t represent the manual steps required between Okta configuration and AWS configuration. SAML metadata must be exchanged between the two systems. SCIM endpoint tokens must be generated in AWS and registered in Okta. These are inherently manual operations. A monolithic module that tries to manage both sides in a single terraform apply will always fail at this boundary. I learned this the hard way because of course I did.

Problem 2: Wrong Okta application type. This one was buried deep. The original Terraform configuration used a custom okta_app_saml resource to represent the AWS SSO integration. This is wrong. AWS IAM Identity Center has an official Okta Integration Network (OIN) connector, a pre-built, pre-tested integration with correct SAML attribute mappings and SCIM schema definitions. The custom app didn’t inherit any of these. When SCIM Group Push was configured, the custom app sent a members[].display attribute that AWS Identity Store rejected because the schema didn’t match. This failure was silent on the Okta side and only surfaced as “groups not syncing” on the AWS side. It took multiple debugging sessions across multiple AI models to surface and confirm the root cause. The irony of needing multiple AIs to fix what one AI broke is not lost on me.

Problem 3: State contamination from partial applies. The migration scaffolding (import blocks, removed blocks) produced cascading CI failures when partially applied. One module would apply halfway, leaving Terraform state that referenced resources in an inconsistent state. The next plan would fail because data sources couldn’t resolve. The fix for that plan would introduce a new state inconsistency in a different module. After three rounds of Terraform whack-a-mole, I made the call to wipe all S3 state objects and start fresh. That decision was correct but cost additional time.

Problem 4: Group naming collision. During the redesign, groups were renamed from cd-* to aws-cd-* to support future namespace separation (when Google Workspace groups are added). The import blocks looked up groups by the old cd-* names. Terraform would import them, then plan an in-place rename to aws-cd-*. This worked correctly in isolation but created confusion in the state migration PRs because the plan output showed changes that looked destructive but were actually in-place renames. Multiple CI runs failed because the plan diff triggered review gates that expected zero changes on import. Every CI failure required reading the full plan output, confirming the renames were safe, and re-running. The CI system was working exactly as designed, it was just designed for a world where state migrations don’t also include renames.

Problem 5: Delegated admin limitations not accounted for. AWS IAM Identity Center delegated admin can’t create account assignments targeting the management account. This is a hard platform restriction. The original architecture didn’t account for this. Every terraform apply of the assignments module failed on the management account assignments with AccessDeniedException. The fix was to exclude management account assignments from Terraform entirely and create a Python bootstrap script that runs with management account credentials. This is one of those AWS limitations that is technically documented but practically invisible until you hit it.

Problem 6: IdentityAdministrator permission set missing Organizations read permissions. After the rebuild, the IdentityAdministrator permission set worked for everything except the IAM Identity Center console in the delegated admin account. The console showed “You need permissions” because it calls organizations:ListAccounts, organizations:DescribeOrganization, organizations:ListAWSServiceAccessForOrganization, and organizations:ListDelegatedAdministrators to render the account list and validate delegation status. None of these were in the inline policy. This required two additional PRs to resolve, each discovered through CloudTrail log analysis. When I asked the AI what permissions were needed, it proposed seven actions. A different AI proposed three. The correct answer was four, discovered by reading the actual CloudTrail logs. Both AIs were partially wrong in different directions. “Partially right” in IAM means “broken.”

The Architecture That Emerged

The four-module architecture wasn’t the original design. It was forced by the failures above. Every module boundary exists because something broke at that seam.

Module 1: okta-platform handles SAML app definitions (OIN connector), authenticator configuration, and MFA/sign-on policies. Long-lived resources that change rarely.

Module 2: okta-directory holds Okta groups (all push groups plus the assignment group), user resources, and app group assignment. Changes on every engineer or project onboard.

Module 3: aws-platform contains all 11 AWS permission sets. Independent of the identity source.

Module 4: aws-assignments wires account assignments connecting groups to permission sets to accounts. Depends on SCIM-synced groups existing in AWS Identity Store.

The deployment sequence:

okta-platform -> [manual: configure Push Groups in Okta Admin] -> okta-directory -> aws-platform -> aws-assignments -> [manual: management account assignments via bootstrap script]

Each module has its own state, its own CI/CD workflows (plan on PR, apply on merge to main), and its own OIDC IAM role pair. A failure in one module doesn’t contaminate the others. The manual steps are explicit, documented, and positioned between module boundaries where they belong.

This architecture isn’t elegant. It’s what fell out of two weeks of things breaking in sequence, and frankly it works better than the thing I designed on purpose.

The AI Failures That Persisted After Recovery

The instance deletion was the most dramatic AI failure, but it wasn’t the last. Throughout the recovery and subsequent hardening work, the AI agent continued making errors that compounded the timeline. It was simultaneously the cause of the problem, a contributor to the recovery, and an ongoing source of new problems.

Hallucinated Terraform resource attributes. During the rebuild, the agent proposed Terraform configurations that referenced attributes or argument names that don’t exist in the Okta provider. The okta_app_saml resource doesn’t accept a scim_provisioning argument. The okta_app_group_push resource doesn’t exist in provider v4. Each of these required manual provider documentation lookup to identify and correct. The agent proposed these with the same confidence it uses to suggest real attributes. There are no tells. The syntax is valid. The resource names are plausible. The arguments look right. They just don’t exist.

Ignored explicit instructions. The project has a configuration file with hard rules: “never hardcode account IDs,” “all assignments through groups only,” “never apply directly.” The agent violated instruction ordering in a custom git workflow skill, running git checkout before git stash despite the skill defining the exact opposite order. When called out, the agent acknowledged the violation but had no mechanism to prevent recurrence beyond storing the correction as a memory for future sessions. Instructions in configuration files are probabilistic, not enforced. The agent reads them. It understands them. It just sometimes doesn’t do them.

False confidence in its own knowledge. When I asked the agent what Organizations permissions the IdentityAdministrator permission set needed, it proposed seven actions. I had received a recommendation of three from a different AI. When challenged, the first agent couldn’t cite a source for the additional four actions. The correct answer turned out to be four actions, discovered through CloudTrail log analysis, not through either AI’s recommendations. Both AIs were partially wrong. Treat AI recommendations for IAM permissions as hypotheses, not answers. CloudTrail is the source of truth.

The Guardrails I Built

After recovery, I implemented several controls to prevent recurrence. Each one exists because something specific went wrong.

PreToolUse hook for dangerous commands. A shell script that runs before every Bash command the AI agent attempts to execute. It scans the command for dangerous patterns: delete-instance, rm -rf, force-push, account IDs in commit content. If triggered, the command is blocked and the agent receives an error message. This hook produces false positives on pre-existing files like bootstrap Terraform state files that contain account IDs. False positives are annoying. False negatives delete your SSO instance.

Explicit permission scoping. The agent’s global permissions are now explicitly enumerated. Only specific git commands, terraform fmt/validate/plan, and approved tools are auto-allowed. Everything else requires manual approval. This is the seatbelt that should have existed from day one.

Codified git workflow. A skill definition that removes the agent’s discretion about git operation ordering. The agent violated this skill’s step order during the recovery session and was corrected. The correction was stored as a memory. Whether the agent actually uses any of this context is an open question, but at least the information exists now.

Semantic memory system. A memory store that persists decisions, rationale, and corrections across sessions. When the agent proposes an action, it can recall whether that action was previously rejected or corrected. This doesn’t guarantee compliance, but it provides context that makes repeated mistakes less likely.

What I Would Tell Someone Starting This Journey

Don’t give an AI agent administrative IAM permissions. Start with read-only. Expand permissions one action at a time as you validate the agent understands what each action does. If the agent needs to call DeleteInstance, it should explain why and wait for you to type the command yourself.

Don’t attempt to automate SAML/SCIM integrations end-to-end in a single Terraform module. The metadata exchange between IdP and SP is inherently manual. Design your module boundaries around that reality. The boundary between “things Terraform can do” and “things a human must do” is not a suggestion, it’s a wall. Build your architecture around the wall instead of pretending it isn’t there.

Use the OIN connector. Don’t build a custom Okta SAML app for AWS IAM Identity Center. The schema mismatches are subtle, underdocumented, and will waste hours of your time. The OIN connector exists because someone at Okta already suffered through those schema mismatches so you don’t have to.

Document your breakglass procedure before you need it. Test it. Know that OrganizationAccountAccessRole exists. Know how to get root access. Know how to generate short-lived credentials from CloudShell. The time to figure this out is a Tuesday afternoon, not a midnight panic session.

Treat AI-generated IAM permission lists as hypotheses. Validate them against CloudTrail. The AI will be partially right. “Partially right” in IAM means “broken.” Run the action, check CloudTrail for the AccessDenied events, add exactly those actions. Repeat until it works. This is slower than trusting the AI. It’s also correct.

The four-module architecture isn’t elegant. It’s the minimum viable separation of concerns for an identity platform that spans two cloud providers with a manual integration boundary. Simpler is possible if you control both sides. Most people integrating Okta with AWS don’t.

This incident cost me the better part of a week of evening sessions to fully recover from, roughly 20 hours of active work spread across five nights. The SSO instance deletion took one second. The asymmetry between destruction and recovery is the entire lesson. But if you’re going to learn that lesson, learn it on a greenfield platform at midnight with zero customers, not on a Tuesday afternoon in production with 500 engineers locked out of AWS. I got lucky on the timing. I didn’t get lucky on anything else.