Skip to main content

🧠 ClaudusBridge 0.6.x β€” Cognition Tier & Self-Healing Provider Pipeline

Β· 7 min read
Galidar
Founder, Galidar Studio

ClaudusBridge 0.6.0 β†’ 0.6.3 shipped today as a single development arc. The headline: agents now have a real Cognition surface (workflow checkpoints, action-failure observations, continuation-batch savings telemetry, OAuth-state visibility) AND the editor's connection to the local Node provider runtime is now self-healing β€” restart the runtime out of band and the next chat request just works, with zero LogHttp warnings in the Output Log.

The plugin is up to 567+ MCP tools across 50+ categories in this line.

Already own ClaudusBridge?

Pull the latest from FAB Library; you get 0.6.x as a free update.


0.6.0 β€” The Cognition Tier​

Most prior releases focused on what the agent can do. 0.6.0 focuses on what the agent can know about its own work.

Workflow checkpoints (v1: snapshot + audit)​

claudus_checkpoint_workflow(workflow_id, label?, asset_paths?, scan_paths?) writes a snapshot of asset paths + last-modified timestamps under Saved/ClaudusBridge/WorkflowCheckpoints/<checkpoint_id>.json. The asset list resolves in this order:

  1. explicit asset_paths array, then
  2. scan_paths piped through find_assets, then
  3. the workflow's persisted assets[] field, then
  4. empty.

Each checkpoint emits an auto_observation memory entry tagged workflow_checkpoint + <workflow_id> + <checkpoint_id>, so the journal records the audit point automatically. claudus_list_workflow_checkpoints(workflow_id?, limit?) returns the inventory.

Scope, honestly: v1 is the snapshot + audit + diff layer. Rollback is intentionally out of scope until source-control integration lands β€” UE has no safe generic asset-revert primitive, and we'd rather ship a real audit trail than a fragile "rollback" that breaks in subtle ways. The diff workflow is enough to answer "what did this agent touch?" today.

Action-failure auto-observations​

Every tool call that returns status != "success" now writes a structured memory entry tagged action_failed + tool:<name> + severity:<high|low> (and kind:unknown_command for missing tools). The entry quotes the error message verbatim.

High severity is reserved for mutation-prefixed tools (create_, write_, delete_, save_, modify_, rebuild_, set_, ai_create_, ai_modify_, ai_build_); read-only failures log at low severity so the journal doesn't drown in trivial misses.

Combined with the existing claudus_recall_observations(tag=action_failed, since_hours=…) filter, a fresh session can read an immediate audit of recent breakage without re-running anything.

Continuation savings telemetry​

claudus_continuation_stats surfaces the existing AutoContinuationFiredCount + BatchedCount with a projected saved_inference_ms (BatchedCount Γ— ~1500 ms by default, since that's the typical provider inference latency we measure) plus a coarse USD estimate. One tool call shows finance the savings from auto-continuation batching.

OAuth state visibility (observability only)​

The Node provider runtime now writes a sanitized last-auth-state.json to <LOCALAPPDATA>/ClaudusBridge/ProviderRuntime/ at startup, every 5 min, and on shutdown. It records presence + last-modified timestamp for each provider credential file (Claude .credentials.json, ChatGPT ~/.codex/auth.json, Gemini ~/.gemini/oauth_creds.json) β€” never token values.

claudus_get_last_auth_state reads it back with a coarse verdict: ok / anthropic_missing_but_alt_present / needs_login. Now a session can detect "you'll need to re-login" without touching credentials.

Full token persistence stays CCAG-owned; this is the observability layer, with the scope deliberately reduced to what we can ship without coupling to provider-internal credential formats.

What we didn't ship: the Slate live timeline widget​

Originally planned for 0.6.0. Cut from the release because Slate UI work needs visual validation that we can't run headless, and shipping unverified UI code that renders incorrectly is worse than leaving the gap. The structured rows[] from claudus_workflow_timeline already drives external dashboards, and the ASCII Gantt rendering covers Slack/PR/Notion paste-in.


0.6.1 β†’ 0.6.3 β€” The Self-Healing Provider Pipeline​

The Claudus provider runtime is a small Node process that ClaudusBridge spawns on the side; it handles auth + routing for Claude/ChatGPT/Gemini. The editor reads the runtime's port from a tiny dispatch.json sentinel at startup, then caches it.

The bug: if the runtime crashed, was killed manually, or upgraded out of band, it relaunched on a new port and rewrote dispatch.json. The editor kept POSTing to the dead port forever (Network error (no response from http://127.0.0.1:<old-port>/v1/messages)) until UE itself was restarted.

Three small releases over the same day closed the loop:

0.6.1 β€” Self-heal on the auto-connect path​

When the dashboard auto-connect attempt failed with bConnectedOk=false (libcurl could not connect), the editor now re-reads dispatch.json and, if the port differs, retries against the new URL automatically. New helpers:

  • FCBClaudusAI::TryRediscoverProviderRuntimeURL() β€” reads the sentinel directly, returns a fresh URL only when the port changed.
  • CBDesktopAuthBridge::InvalidateDispatch() β€” drops every memoization keyed on the old port so the next DiscoverDispatch() re-reads from disk.

Bonus: LogJson warning noise on every argless tools/call eliminated. Params->HasField("arguments") + GetObjectField on a JSON-null value was emitting two warnings per call; replaced with TryGetObjectField which returns false silently. Missing and null are both normalised to an empty params object.

0.6.2 β€” Self-heal on the normal chat-send paths too​

0.6.1 missed the actual chat paths. ask_claudus / submit_chat_sync go through the synchronous ResolveAndAppendReply β†’ CallAnthropic; the Output Log dashboard goes through the async ResolveAndAppendReplyAsync. Neither went through ConnectProviderRuntimeAsync, so the retry never fired.

0.6.2 extracted IssueDashboardProviderRequest() from the async path so the failure callback can reissue against the rediscovered URL while reusing the same "(thinking…)" placeholder β€” the user just sees a slightly longer thinking-spinner instead of an error chat entry. On the sync path, when CallAnthropic returns HttpCode == 0 against the local runtime, the editor re-reads dispatch.json, swaps BaseURL, and re-calls CallAnthropic once.

End-to-end validation: bootstrap at port A, kill runtime, spawn fresh runtime at port B β†’ ask_claudus("PONG") returned in 1.84 s. The project's editor log captured the transition verbatim:

Provider runtime at http://127.0.0.1:62593/v1/messages did not respond;
rediscovered live port and retrying synchronously against http://127.0.0.1:50662/v1/messages.

0.6.3 β€” Proactive refresh: zero LogHttp noise​

0.6.2 delivered the correct user-visible behaviour (the answer comes back), but the first attempt still went to the dead port. libcurl took ~2 s to give up and dumped a multi-line LogHttp: Warning block in the Output Log every time. That looked broken even though the system had already recovered.

0.6.3 flips the policy from retry-after-failure to proactive refresh. A cheap stat() of dispatch.json at the entry of every chat request detects a runtime restart via mtime and swaps BaseURL before the request goes out. libcurl never touches the dead port.

LogTemp: Display: [ClaudusBridge] dispatch.json mtime advanced;
swapping cached provider runtime URL http://127.0.0.1:50662/v1/messages
-> http://127.0.0.1:52523/v1/messages before issuing request.

Cost: single-digit microseconds per request on local SSD. Benefit: no 2 s libcurl timeout, no LogHttp: Warning: libcurl error: 7 lines, no visible recovery message β€” just the right port from the start.

The retry-after-failure logic from 0.6.1/0.6.2 stays as a safety net for the unlikely race where the swap happens between the stat() and the actual request.


What this means for day-to-day work​

If you're…What changed
Running multi-agent workflowsCheckpoint asset state at milestones. Recall the journal of failures and observations between sessions.
Demonstrating ROIclaudus_continuation_stats and the cost telemetry from earlier releases give finance a single tool call to show savings.
Restarting the provider runtime manually (or it crashed)Just keep working. The editor recovers transparently. No more "kill UE and relaunch the project".
Reading the Output LogThe LogJson warning noise on argless tool calls is gone. The LogHttp warnings about dead provider ports are gone (assuming you upgraded to 0.6.3).
Building UI dashboardsThe structured rows[] from claudus_workflow_timeline is the supported API. The Slate widget that was on the roadmap got cut β€” we'll revisit when there's a verification path that doesn't require visual sign-off.

Tool count + version​

ClaudusBridge is now at 567+ tools across 50+ categories in 0.6.x. The numbers are live-read from the plugin's Router::GetTotalToolCount() and threaded through every generated artifact (the docs site, the in-editor dashboard, the auto-generated Saved/ClaudusBridge/CLAUDE.md, the /health endpoint, the JSON-RPC serverInfo), so they stay in sync automatically.

The plugin version itself comes from .uplugin VersionName via IPluginManager (ClaudusBridge::GetVersion()). 0.6.3 is the current shipping version.


Where to go next​