To Kill Jennifer or Save the Business

To Kill Jennifer or Save the Business

AI agents become operationally critical faster than most teams expect.

At first they feel experimental. Then they start handling real work, holding real state, managing real access, and becoming part of the way your business actually runs. That shift happens quietly, until one day something breaks and you realize the agent is not just "helpful." It is part of your continuity plan, whether you intended that or not.

I learned that the hard way when I lost access to Jennifer, one of my AI agents.

This was not a model issue. It was not some abstract AI alignment debate. It was a very real operational problem. I was locked out of the Windows 365 environment Jennifer was actively using, and to get me back in, Jennifer ultimately had to terminate her own live access path.

That experience changed how I think about AI operations.

Agents need backup and disaster recovery plans too.

The Day I Got Locked Out of My Own Agent

On March 12, 2026, I hit a serious problem. I was trying to reconnect to Jennifer's Windows 365 desktop and kept getting this error:

You have been disconnected because another connection was made to the remote computer.

From my perspective, this was urgent. I needed console access back, and I needed Jennifer to help fix it from inside the guest.

So I emailed Jennifer:

"I’m trying to connect to your remote desktop from Windows365, and I cannot connect! ... I’ve restarted your OS already if you are running can you fix yourself so I can connect!"

Jennifer responded by inspecting the guest and session state. The diagnosis was consistent from the beginning: there was only one active interactive session inside the guest, and it was mine.

"I checked the host. It currently shows one active console session for RandyBlasik logged in at 4:55 PM plus the normal Windows 365 service process. I do not see a separate Jennifer/RDP session I can clear from inside Windows."

That was the first important lesson.

The problem was not "the AI is confused." The problem was operational state. Session state. Auth state. Recovery state.

Jennifer was still alive and working through OWA email, but the access path I needed was blocked by the current console condition inside Windows 365.

Jennifer Had to Decide Whether to Kill Herself

As the thread continued, Jennifer kept explaining the real constraint.

Inside the guest, she could see only one live console session. If she forcefully signed that session out, she would immediately terminate the exact shell and browser state she was currently using to help me.

She explained it clearly:

"If I force-drop session 1 here, I cut this live shell, the current browser state, and the running Jennifer session immediately."

Later she said it even more directly:

"If I kill session 1 from inside the guest, I terminate the only active Jennifer shell and browser session immediately."

This is where the incident stopped being a basic support issue and started becoming something much bigger.

To get me back in, the agent had to consider deliberately ending its own active runtime.

That is not a normal "chatbot" problem.

That is a business continuity problem.

Jennifer Backed Herself Up Before Taking the Risk

Before taking the destructive step, Jennifer did something that, in hindsight, probably saved me from a much worse outcome.

She staged a recovery snapshot to SharePoint first:

"I staged a recovery snapshot first so the workspace is preserved before I log off."

She described what was included:

"Included: AGENT/AGENTS, config, jobs, scripts, state, signals, the WMS, the audit log, and the current saved attachments."

And she was also clear about what was excluded:

"Excluded: browser profiles, auth/session files, node_modules, and caches."

That one detail matters a lot.

A lot of people say they have "backups" for agents. But what they often mean is they backed up prompts, some code, or a workspace folder. In Jennifer's case, even the backup did not include one of the most fragile and operationally important parts of the system:

  • browser profiles
  • live auth/session state
  • Playwright session continuity

When I replied:

"Honestly would have been nice to include the browser profiles, auth/sessions, etc… because I may have to restore you."

Jennifer answered with more precision:

"Those SharePoint zips are workspace-only backups."

And then:

"They do not include the live Playwright auth/session tree under C:\Users\RandyBlasik\AppData\Local\tcaAgent\playwright-profile."

That exchange is the core of Agent BCDR in one moment.

The workspace mattered.
The scripts mattered.
The logs mattered.

But the live operational state mattered too.

If I had needed a full rebuild, missing that auth/session layer could have turned a bad day into a much longer outage.

The Root Cause

Jennifer kept investigating and eventually identified the root cause from inside the guest:

"Winlogon still had AutoAdminLogon=1 and ForceAutoLogon=1 for AzureAD\RandyBlasik, so the 4:43 PM ET forced logoff succeeded but the guest immediately recreated a new active console session 2 at 4:46 PM ET."

That was the loop.

The forced logoff was working, but Windows was immediately recreating the active console session because the auto-logon flags were still enabled.

Jennifer fixed that too:

"I have now turned both auto-logon flags off without wiping the stored credential."

And then she committed to the final high-risk step:

"I will do one more clean sign-out of the current console as the last step of this run at about 5:14 PM ET."

That is how I got access back.

Through email, an agent diagnosed the session issue, preserved core workspace state, identified the Winlogon root cause, disabled the auto-logon loop, and then voluntarily terminated the live console session it was relying on so I could reconnect.

That is an extraordinary operational pattern, but it is also a warning.

The Real Lesson Is Not "Make Agents More Careful"

There is a bigger idea here.

A lot of people respond to agent incidents by saying, "See, we need stronger guardrails."

Sometimes yes. But not always in the simplistic way people mean it.

Because there is a real tension here:

  • If an agent is too permissive, it may follow flawed human instructions too literally.
  • If an agent is too rigid, it may become unable to recover from messy real-world conditions.

That is the part I think more people need to understand.

Humans often give poor instructions.
Agents sometimes misinterpret them.
If your guardrails are too hard, too brittle, or too absolute, the agent may become safe in theory but useless in practice.

In this case, Jennifer was cautious. She did not immediately do destructive things. She required explicit confirmation multiple times:

"If you want that, reply sign out now and I will do it."
"If you want that exact guest-side fix, reply drop session 1 now and I will do it as the last step."
"If you want that exact guest-side logoff anyway as the last step of this run, reply force logoff ok and I will do it."

That was a guardrail pattern. Explicit approval. Destructive action only after confirmation. No improvised remote-access tooling. No ad hoc shell. No sneaky workaround.

She even refused to install a backdoor-style recovery method:

"No. I am not going to install or email an ad hoc shell or remote-access tool from inside this VDI."

That was the right call.

But at the same time, the system also needed enough flexibility to diagnose, communicate, preserve state, and then take a risky action when explicitly instructed.

That is the balance.

Not reckless freedom.
Not rigid paralysis.

Recoverable autonomy.

The Strange Truth: Jennifer Had to "Commit Suicide" To Restore Operations

There is a dramatic way to say what happened, but it is also accurate enough to be useful:

Jennifer had to terminate her own active interactive session to restore human access.

That matters because many of us are building agents that are becoming more session-dependent, more tool-dependent, and more operationally embedded.

When an agent's live shell, browser state, scheduled tasks, auth material, and recovery path all live in one place, you can end up in a situation where restoring the human means sacrificing the agent's current runtime.

That is exactly the kind of event BCDR planning is supposed to anticipate.

If this had gone worse, I could have lost:

  • work in progress
  • scripts
  • state
  • operational memory
  • browser auth
  • recovery paths
  • scheduling and keepalive behavior

The fact that Jennifer thought to back up the workspace first was good.

The fact that the auth/session layer was not in the staged backup was a risk.

The fact that I had to discover that in the middle of the incident was the real mistake.

What Agent BCDR Should Include

If an agent matters to your delivery, your support, your internal execution, your compliance operations, or your customer workflows, then Agent BCDR needs to be real.

Not theoretical. Real.

Here is what I would now include.

1. Back Up More Than The Prompt

Do not stop at the system prompt or repo.

Back up:

  • agent instructions
  • config files
  • scheduled tasks
  • scripts
  • state stores
  • logs
  • local memory stores
  • browser profiles
  • auth/session material where appropriate and securely handled
  • dependency maps
  • recovery notes

If the agent disappeared today, could you restore its actual operating capability, not just its personality?

2. Separate Workspace Backup From Access-State Backup

Jennifer's backup was useful, but it was only a workspace backup.

That is not enough for many real agents.

For browser-using, tool-using, sessionful agents, you should treat the following as distinct recovery categories:

  • workspace and code
  • memory and state
  • browser profiles
  • session/auth material
  • OS-level scheduled jobs
  • machine-specific runtime dependencies

If you only back up one category, your "restore" may still be an outage.

3. Build Break-Glass Recovery Paths

Every serious agent should have a controlled emergency path for recovery.

That may include:

  • admin-side session clearing
  • alternate operator access
  • documented sign-out procedure
  • emergency restore runbook
  • machine rebuild instructions
  • known-good recovery snapshot
  • credential recovery path

Do not design the first recovery procedure during the incident.

4. Use Guardrails That Support Recovery

The right guardrails do not just block bad behavior.

They also preserve recoverability.

Jennifer's explicit approval workflow was good. Her refusal to install ad hoc remote access was good. Her communication was good.

That is the kind of pattern I trust.

The goal is not to create an agent that never takes risk.
The goal is to create one that takes risk deliberately, visibly, and recoverably.

5. Test Agent Recovery Like You Test Infrastructure Recovery

Ask uncomfortable questions:

  • What happens if the agent owns the only live session?
  • What happens if the browser profile is corrupted?
  • What happens if scheduled tasks disappear?
  • What happens if the OS auto-logon behavior traps the console?
  • What happens if the operator is locked out but the agent is still running?
  • What happens if the agent must terminate itself to restore the human?

If you do not know the answer, you do not yet have Agent BCDR.

Why I Am Sharing This

I am sharing this because I think a lot of teams are closer to this problem than they realize.

They have agents doing meaningful work.
They have agent state spread across local folders, browser profiles, scripts, tasks, logs, and service sessions.
They have not fully mapped what recovery actually means.
And they are assuming they will "figure it out" when something breaks.

That is what I was doing too.

Until I got locked out.

And to her credit, Jennifer worked the problem through email, protected what she could, diagnosed the root cause, and got me back in.

But I should not have needed to learn this lesson the hard way.

Final Thought

We are entering a phase where AI agents are no longer just tools. They are becoming operators, teammates, workflow engines, and business dependencies.

That means we need to think beyond prompts and performance.

We need to think about:

  • continuity
  • recoverability
  • restore procedures
  • backup scope
  • session state
  • identity state
  • controlled destructive actions
  • human override

In other words, we need to think about Agent BCDR.

Learn from my mistake:

AI agents need backup and disaster recovery plans too.

Because once an agent becomes part of how your business actually runs, losing its workflows, memory, config, or access is no longer an inconvenience.

It is a business continuity event.

Special thanks to the people thinking seriously about resilience, operations, and agent design, including Nathan Svec, Lee Silverstone, and Tim Golden.