VM Idle Management

Use VM Idle Management to reduce compute cost without changing workflow logic.

Path: Settings -> VM Management

Default behavior

Auto-stop is enabled by default for all organizations. New orgs are created with:

Auto-Stop Idle Agents: enabled
Idle Timeout: 30 minutes
Auto-Wake on Job Dispatch: enabled

This ensures cost efficiency is the baseline operating mode. Adjust or disable in Settings as needed.

Configure org-level VM policy

Toggle Auto-Stop Idle Agents on or off.
Set Idle Timeout from 10 to 120 minutes.
Enable Auto-Wake on Job Dispatch if stopped agents should boot automatically when work is queued.
Save VM settings.

What counts as activity

Idle logic uses lastActivityAt (work activity), not just connection heartbeat.

Activity is touched by:

job dispatch
step-status updates
worker log ingestion batches via POST /api/job-logs (resolved from jobId)
handshake and healthy reconnect events
workspace keepalive pings via POST /api/agent/{id}/activity
other execution-affecting agent actions

Idle stop also checks for active jobs before stopping an agent. Jobs in queued, processing, or running state block idle-stop.

Per-agent override (Always-On)

On Agents -> Agent View, use the status chip toggle:

Auto-Stop means agent follows org idle policy.
Always-On means agent is exempt from auto-stop.

This sets idleAutoStopExempt for that agent.

Dispatch behavior with stopped agents

When dispatching via agent action endpoints, scheduled runs, or retries:

If agent is stopped and auto-wake is enabled, Mimic starts it and waits until online (up to 2 minutes).
If auto-wake is disabled, dispatch returns a conflict/error response and the run is not started.
The same auto-wake policy applies consistently across manual API calls, scheduled runs, and retry attempts.

From /app/agents, operators can issue Start and Stop directly from the fleet table. These actions map to POST /api/agent/{id}/action with start/stop and emit manual_start/manual_stop lifecycle events.

Resume latency: Starting a stopped agent takes 60-120 seconds. Schedule time-sensitive RCM workflows with this buffer in mind.

Lifecycle events

Every idle stop and auto-wake transition is recorded as a persistent lifecycle event:

Event Type	When
`idle_stop_attempt`	Idle monitor detects an agent past its timeout
`idle_stop_success`	Agent successfully stopped
`idle_stop_failed`	Stop attempt failed (EC2 error)
`wake_requested`	Auto-wake initiated for a job/schedule
`wake_blocked`	Auto-wake disabled by org policy
`wake_ready`	Agent came online after wake
`wake_timeout`	Agent did not come online within timeout
`wake_failed`	EC2 start command failed
`manual_start`	User-initiated start via UI/API
`manual_stop`	User-initiated stop via UI/API
`manual_reboot`	User-initiated reboot via UI/API

These events are visible in:

Agent Events SSE stream (GET /api/agents/{id}/events) as lifecycle events
Agent Logs endpoint (GET /api/agents/{id}/logs) merged into the unified log timeline
Agent View UI as status badges showing “Waking from idle” state

Agent View performance behavior

Agent View now separates critical status polling from heavy tab data fetches:

Live transition polling uses GET /api/agents/{id}/status (lightweight payload).
Recent runs load from GET /api/agents/{id}/runs.
Portals, functions, and script library data load on demand when tabs are opened.

This reduces DB load during provisioning/reboot transitions while keeping operator feedback near real time.

Verification checklist

Enable auto-stop with a low timeout (e.g., 10 minutes) in dev.
Wait for an idle agent to be stopped — confirm idle_stop_success event appears.
Dispatch a run to the stopped agent — confirm wake_requested and wake_ready events.
Check the agent view UI shows the “Waking from idle” badge during resume.
Verify the run completes successfully after wake.

Watchdog behavior for stale runs

Background monitors enforce runtime limits:

jobs with stale heartbeats and exceeded max runtime are failed
hung worker process is killed on the VM
agent status is reset for subsequent dispatch
retry pipeline can requeue according to scheduling policy

Use this with RPA-first scripts to keep long-running automations deterministic and recoverable.

Session lock recovery (Windows)

Idle policy and session recovery solve different failure classes:

idle policy controls compute stop/start economics
session recovery prevents lock-screen induced mid-run termination

For guarded watchdog rollout and live patch procedures, see Windows Session Recovery.