MLX to vLLM migration validated

Move the mind,
not just the model.

PermeantOS is an open-source research-preview hypervisor for live AI agent state migration. It extracts active KV cache state, moves it through USXF, and attaches it to a target runtime so an agent can resume without replaying its full context.

Current validated result

Local Apple Silicon MLX to AWS g4dn.xlarge vLLM 0.23.0, using Qwen/Qwen2.5-0.5B-Instruct. Hash and slot probes passed, 16 vLLM prefix-cache blocks were seeded, and the source/post-migration continuation matched exactly for a 16-token validation horizon.

Run Simulator Explore Architecture

2016: validated prefix tokens
24: Qwen layers migrated
0.0: slot-probe max diff

Permeant migration node

USXF v1.1

SOURCE

Apple Silicon MLX

live source adapter

active KV

signed USXF stream

TARGET

AWS vLLM CUDA

prefix-cache attachment

seeded

Interactive migration trace

Live Migration Simulator

Step through the same handoff PermeantOS performs in the validated MLX-to-vLLM path: freeze, encode, stream, seed, verify, commit.

SOURCE HOST

Apple Silicon / MLX

0% Encoded

USXF sealed frame

TARGET HOST

AWS g4dn / vLLM

PERMEANT DAEMON TELEMETRYMLX-TO-VLLM-X64

Telemetry ready. Click "Migrate State" to capture live attention caches.

USXF v1.1

A runtime-neutral envelope for active KV state.

USXF stores model identity, attention geometry, sequence length, dtype, transfer quantization, block hashes, checksums, and signatures so source and target runtimes can reason about migratable state without sharing a hardware layout.

The current implementation focuses on live KV-cache migration. Full Agent Memory Graph state is planned but not yet implemented.

{
  "usxf_version": "1.1",
  "source_runtime": "mlx",
  "target_runtime": "vllm-0.23.0",
  "model_architecture": "Qwen2.5-0.5B-Instruct",
  "sequence_length": 2016,
  "layers": 24,
  "block_hashes": ["sha256:..."],
  "signature": "ed25519:..."
}

A compact USXF envelope view: identity, runtime geometry, hashes, and signature in one migratable contract.

Evidence, not theater

Validated result and analytical break-even.

The strongest result is deliberately narrow: one real MLX-to-vLLM migration, exact short-horizon continuation fidelity, and measured tensor integrity at the target.

Real runtime

MLX to vLLM

Local Apple Silicon source, AWS g4dn.xlarge target, vLLM 0.23.0, Qwen2.5 0.5B.

Validation

Exact 16-token horizon

Hash validation passed, sampled key/value tensors matched with max diff 0.0, and post-migration decode matched source continuation.

Analysis

32k+ break-even

The paper estimates migration becomes favorable at long contexts, especially at 25 Gbps with FP8 transfer quantization.

Latency Savings Calculator

Conversation Context Length32768 tokens

Uses the whitepaper prefill model: T = alpha*S + beta*S^2, with alpha=1e-5 and beta=6e-10. This is analytical, not a measured E2E runtime.

Analytical re-prefill estimate0.97s

Validated live migration result2016-token prefix

Analytical comparison

Long contexts are where state migration starts to matter. Actual cutover time depends on model geometry, bandwidth, quantization, runtime hooks, and host warmup.

Architecture

How PermeantOS works.

The daemon and adapters coordinate extraction, normalization, transport, target allocation, verification, and commit.

Stage 01

Roadmap

Toward Agent Memory Graph migration.

KV cache portability is the first layer. The next major milestone is transactional migration of conversation state, tool calls, artifacts, retrieval memory, provenance, and pending work.

KV cache migration

Validated MLX source extraction, USXF transport, target KV write, prefix-cache seeding, and exact short-horizon continuation.

Validated

Broader runtime fidelity

Repeat validation across longer horizons, larger contexts, more model families, quantized transfers, and prewarmed cloud images.

In progress

Full Agent Memory Graphs

Serialize graph state containing turns, tools, artifacts, retrieval bindings, system checkpoints, and token-span mappings.

Planned

Apache 2.0

Contribute to portable long-running intelligence.

PermeantOS is open-source infrastructure research. Contributions are especially welcome around runtime adapters, manifest analysis, Agent Memory Graph schema design, reproducible benchmarks, security review, and documentation.

Join GitHub Contributors Read Technical Paper

Technical whitepaperJune 2026

PermeantOS White Paper: Live migration for AI agent state

A technical walk-through of USXF, the live MLX-to-vLLM validation, and the roadmap toward full agent-state migration.

HTML documentOpen