The Autonomous Environment: Unifying Agents and Invisible Enforcement

I’ve been running Gemini CLI and Claude Code side by side for a while now, and the friction was getting obvious. Both tools are capable on their own, but they had no shared understanding of my standards, so I was constantly re-explaining the same rules across sessions, catching the same categories of mistakes, and manually triggering security checks that should have been invisible. This was the session where I decided to stop babysitting and build something that polices itself.

The Problem with Manual Policing

The core issue wasn’t that the agents were bad at their jobs, it was that there was no single Law they both recognized. I’d catch one of them proposing an IAM wildcard, remind it that I don’t do wildcards, and then find the same pattern in the next session because the correction only lived in the conversation context, not in the environment. The FIPS requirement for my-production-project, the mandatory migration-path note for any cost-saving shortcut in a POC module, the “no hardcoded secrets” rule. All of it existed in markdown files that the agents MAY have read, depending on the session.

When enforcement only happens when I manually ask for it, I’ve already lost. The cognitive overhead of remembering to trigger the audit is roughly equivalent to just doing the audit myself.

Identity Parity Across Tools

The first thing I did was establish a unified identity so both tools felt like the same environment. I created a global GEMINI.md and refactored CLAUDE.md to share the same persona, the same Execution Gate protocol, and the same tiered standards for my-production-project vs. MyProjects. The agent is the Lead Agent or Orchestrator in both files, I’m the Technical Lead, and the interaction rules are identical: every completion requires empirical proof, and three consecutive failures mean stop and analyze, don’t retry blind.

The rules in ~/.claude/rules/ now act as the control plane for Claude, and the equivalent structure in ~/.gemini/ does the same for Gemini. Both sets of rules override general suggestions the model might have, which matters more than it sounds, because the model’s priors aren’t aligned to my specific security posture or cost constraints.

Invisible Enforcement via Shell Hooks

The bigger change was moving from manual “run the security skill” to an autonomous background auditing system wired into file write events. I refactored block-dangerous.sh into a modular set of domain-specific checks: git.sh, terraform.sh, and fs.sh, each responsible for a different blast radius. The fs.sh module in particular was worth building because it closes a loophole I’d already hit: an agent can truncate a file to zero bytes without using a recursive delete, which the old hook didn’t catch.

The more interesting piece was integrating audit.sh into the post-write.sh hook, so every file write triggers a background scan for:

Violations get written to stderr immediately, so the agent sees the issue before it finishes presenting its plan. I don’t have to notice it, ask about it, or remember to check. The environment just tells the agent it’s wrong and the agent corrects course.

Organic Memory via Engram (Theory, Not Proof)

The other half of this was replacing static markdown “brain files” with Engram as the longitudinal memory system. The protocol I designed is straightforward in concept: recall before proposing, store after deciding. What made this more than a fancy notes system was adding mandatory rationale capture. Every major architectural shift requires the agent to store the Why, specifically the trade-offs considered and alternatives rejected, not just the decision itself.

The session-start behavior actually works. A UserPromptSubmit hook fires on the first message, calls the Engram API directly via mTLS, and injects the last 8 project memories and 3 global memories as context before the model sees anything. The agent starts oriented without me having to explain where we left off. That part I’ve verified.

The part I haven’t solved is mid-session recall. The agent doesn’t know what it doesn’t know, so the “recall before proposing” instruction in CLAUDE.md is probabilistic at best. The model’s prior to just answer is strong, and it fires before any self-assessment about whether it should look something up first. I’ve spent more time than I’d like to admit trying to get reliable mid-session Engram usage, and the honest answer is it doesn’t happen consistently.

The Next Hypothesis: Prompt-Driven Injection

The theory I’m currently testing is moving recall out of the model’s hands entirely. The same UserPromptSubmit hook that handles session start can fire on every subsequent message with a different condition. The idea: classify each prompt before the model sees it, skip obvious affirmations and short continuations, extract a keyword query from anything substantive, call Engram recall with that query, and inject the top 3 matching memories as context for that specific turn.

The classification is heuristic-based for now: strip stop words, take the first 10 unique tokens, use that as the search query. It’s not semantic and it won’t be right every time, but even 70% recall accuracy is better than the near-zero rate I get from relying on the model to self-assess. The key architectural point is that this happens at the shell level before the model is invoked, so it doesn’t depend on the model deciding to do it.

I’ve written the hook and wired it into settings.json. What I don’t know yet is whether injecting 3 potentially-relevant memories per turn will actually change how the model responds when it matters, or whether the context gets deprioritized in the attention window the same way CLAUDE.md instructions sometimes do. That’s what I’m about to find out.

Where This Lands

The shell hooks layer is working and verifiable. File writes trigger audits, dangerous commands get blocked before they run, and the session-start injection means I’m not re-orienting the agent at the beginning of every session. That’s real, measurable reduction in overhead.

The Engram mid-session layer is still a hypothesis. The prompt-driven injection is the most mechanically reliable approach I’ve found, but I don’t have data yet on whether it actually changes outcomes. The multi-agentic question of whether hooks work correctly when a sub-agent is the one writing files I haven’t touched yet. One problem at a time.