AI Startup: Incident Response to SOC 2
The situation
An AI startup discovered that external service API tokens had been compromised and were being abused. I was brought on to lead the investigation.
The underlying infrastructure was capable but inconsistent. The team used Lambdas, ClickHouse, Astronomer, and ECS, with a few partial Terraform efforts. Most IAM roles, VPCs, security groups, and ECS clusters had been configured manually or wired into GitHub Actions through an admin-privilege “CI/CD” service account.
What I did
Incident response. Led the investigation, identified the scope of the compromise, and resolved it.
Cloud security audit. After the incident, I did a thorough audit of their AWS environment. The findings were what you’d expect given the setup: overprivileged roles, inconsistent network boundaries, no organization-level controls.
Remediation through Pulumi. Every fix was done by importing the existing resource into Pulumi first, then making the change in code. No more manual console fixes. I also set up an AWS Organization with Control Tower so there was a proper account structure and guardrails going forward.
CI/CD security. Replaced the admin-privilege AWS access keys in GitHub Actions with OIDC federation and least-privilege IAM service roles. Each workflow now assumes a scoped role with only the permissions it needs. No more over-privileged AWS secret keys sitting in GitHub secrets, and no more personal access tokens to worry over.
Agent tooling. Set up the new IaC repo with full Claude Code agent configs so the team could learn by example and self-serve infrastructure changes going forward.
SOC 2 renewal. Stayed on to support the team through their SOC 2 audit, with the remediated infrastructure and new controls as evidence.
The outcome
The team passed their SOC 2 renewal. Infrastructure changes now go through code review. The admin-privilege CI/CD account is gone. The work that started as incident response became the foundation for how the team manages infrastructure going forward.