-5.9 C
New York
Friday, February 6, 2026

Tips on how to Replace LLM Weights with No Downtime


Think about attempting to renovate the muse of a towering skyscraper with out asking its occupants to go away or pause their work. That’s precisely what MoonshotAI’s Checkpoint Engine does for AI fashions. It permits huge language fashions to replace their brains, the weights, whereas nonetheless operating, so there’s no downtime. This breakthrough lets builders enhance their AI shortly and effectively, even on fashions with over a trillion parameters operating on hundreds of GPUs. It’s quick, dependable, and designed to maintain AI methods operating easily whereas evolving in real-time, making it an important software for cutting-edge AI purposes. This text goes over what it’s, the way it works, and why it issues for the way forward for large-scale AI methods.

What’s Moonshot AI’s Checkpoint engine?

Moonshot AI’s Checkpoint Engine is a specialised middleware designed to replace the weights of massive language fashions (LLMs) in real-time throughout inference with out interrupting ongoing operations. This functionality is important in Reinforcement studying situations the place mannequin weights should be up to date often. The Checkpoint Engine at the moment integrates seamlessly with vLLM inference frameworks and affords optimized efficiency via pipelining and reminiscence administration methods. It additionally supplies options like reusing weights from present cases to scale back overhead in scaling situations.

Structure 

The core of the Checkpoint is the ParameterServer class, which handles the load replace logic and orchestrates the info circulation.

  1. H2D(Host to System): Strikes up to date weights from CPU reminiscence or storage to GPU reminiscence, utilizing optimized switch pipelines.
  2. Broadcast: Distributes the load throughout all inference engine cases effectively, leveraging CUDA IPC buffers for shared reminiscence communication.
  3. Reload: Every inference engine then selectively reloads related weight shards from the broadcasted knowledge in line with its sharding sample.

These three-stage pipelines guarantee environment friendly, overlapping communication and copying for pace.

When GPU reminiscence is restricted, the system can fall again to serial execution to take care of reliability.

Strategies Used

The Checkpoint Engine makes use of two most important strategies to replace mannequin weights throughout inference.

  1. Broadcast Methodology: That is the quickest and the default strategy. That is preferrred when numerous inference cases should be up to date concurrently. It broadcasts the up to date weights from CPU reminiscence to all inference GPUs synchronously, guaranteeing all cases keep completely in sync with minimal delay. 
  2. P2P (Peer-to-Peer) Methodology: It’s used when inference cases are added or eliminated dynamically throughout runtime. It avoids disrupting present inference workloads by sending weights instantly from CPUs in present cases to GPUs in new cases via a peer-to-peer switch system, permitting easy and versatile updates.

Working 

The Checkpoint Engine orchestrates your entire switch course of. It first gathers crucial metadata to create a plan, together with deciding the correct bucket dimension for knowledge switch. Then, it executes the switch, controlling the inference engine via a ZeroMQ socket to maximise efficiency. It organizes knowledge switch into pipelines with overlapped communication and replica, enabling quick and environment friendly weight updates even below heavy workload.

By implementing the above-mentioned strategies and structure, the Checkpoint Engine allows dwell weight updates for LLMs throughout hundreds of GPUs with minimal latency and repair disruption.

Set up and Utilization

Set up

To make use of the quickest broadcast 

Use Code:

pip set up checkpoint-engine

To make use of the versatile P2P implementation:

Use Code:

pip set up 'checkpoint-engine[p2p]'

This may set up mooncake-transfer-engine to help RDMA switch between completely different ranks.

Instance Use case

Step 1:

Put together an H800 or H20 machine with 8 GPUs with the most recent vLLM. Be sure you embody /collective_rpc API endpoint commit (out there in the primary department) since checkpoint-engine will use this endpoint to replace weights.

Step 2:

set up checkpoint-engine

Code:

uv pip set up 'checkpoint-engine[p2p]'

Step 3:

For our use case, we’re gonna use Qwen/Qwen3-235B-A22B-Instruct-2507 because the take a look at mannequin.

Code:

hf obtain Qwen/Qwen3-235B-A22B-Instruct-2507 --local-dir /choose/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/

Step 4:

Begin vLLM in dev mode and set –load-format dummy. Ensure that to set –worker-extension-cls=checkpoint_engine.employee.VllmColocateWorkerExtension

Code:

VLLM_SERVER_DEV_MODE=1 python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 19730 --trust-remote-code 

    --tensor-parallel-size=8 --max-model-len 4096 --load-format dummy 

    --served-model-name checkpoint-engine-demo --model /choose/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/ 

    --worker-extension-cls checkpoint_engine.employee.VllmColocateWorkerExtension

To replace weights by checkpoint-engine. No want to attend for vLLM to prepare. Use the code under.

Code:

torchrun --nproc-per-node 8 examples/replace.py --update-method all --checkpoint-path /choose/fashions/Qwen/Qwen3-235B-A22B-Instruct-2507/

To reuse weights from present cases

New checkpoint-engine cases can be a part of present cases and reuse their weights.

Utilizing the strategy under:

Step 1: Begin the prevailing occasion with –save-metas-file global_metas.pkl to avoid wasting world metas to a file.

Step 2: Use –sleep-time 300 to verify they keep alive.

Code:

torchrun --nproc-per-node 8 examples/replace.py --checkpoint-path $MODEL_PATH 

    --sleep-time 300 --save-metas-file global_metas.pkl

Step 3: After a checkpoint is registered, new cases can receive a replica of the checkpoint by setting –load-metas-file global_metas.pkl

Code:

torchrun --nproc-per-node 8 examples/replace.py --load-metas-file global_metas.pkl

FP8 quantization

At the moment, FP8 quantization doesn’t work in vLLM when updating weights. It makes use of a easy patch in patches/vllm_fp8.patch to deal with the proper weight replace. Additionally ,this patch is simply examined in DeepSeek-V3.1 and Kimi-K2. So there are probabilities of having some compatibility points with different fashions.

Take a look at

Run a easy correctness take a look at for checkpoint_engine

Code:

torchrun --nproc-per-node 8 checks/test_update.py

Benchmark

Mannequin System Setup Metadata Gathering Replace (Broadcast) Replace (P2P)
GLM-4.5-Air (BF16) 8x H800 TP8 0.17 seconds 3.94 seconds (1.42 GiB) 8.83 seconds (4.77 GiB)
Qwen3-235B-A22B-Instruct-2507 (BF16) 8x H800 TP8 0.46 seconds 6.75 seconds (2.69 GiB) 16.47 seconds (4.05 GiB)
DeepSeek-V3.1 (FP8) 16x H20 TP16 1.44 seconds 12.22 seconds (2.38 GiB) 25.77 seconds (3.61 GiB)
Kimi-K2-Instruct (FP8) 16x H20 TP16 1.81 seconds 15.45 seconds (2.93 GiB) 36.24 seconds (4.46 GiB)
DeepSeek-V3.1 (FP8) 256x H20 TP16 1.40 seconds 13.88 seconds (2.54 GiB) 33.30 seconds (3.86 GiB)
Kimi-K2-Instruct (FP8) 256x H20 TP16 1.88 seconds 21.50 seconds (2.99 GiB) 34.49 seconds (4.57 GiB)

Insights

Listed below are a number of observations that I’ve made:

  1. The printed technique typically affords the quickest replace time, optimized for synchronous weight updates throughout many inference cases.
  2. The P2P technique takes longer however allows dynamic updates when cases be a part of or depart throughout runtime.
  3. These benchmark reveals the scalability of Checkpoint Engine, dealing with a trillion parameter fashions effectively on clusters starting from 8 to 256 GPUs

Limitations of Checkpoint Engine

Whereas Checkpoint Engine is a strong answer for dwell weight updates in LLMs, it at the moment has some limitations.

  • Works Greatest with vLLM for Now: The engine is principally examined with the vLLM framework. When you’re hoping to make use of it with different AI frameworks or customized setups, you may want some further work to get it operating easily.
  • Pipeline Nonetheless Enhancing: The perfect seamless pipeline that overlaps knowledge strikes completely isn’t totally completed but. This implies there’s nonetheless potential to make the updates even sooner.
  • P2P Replace Might Be Smoother: The peer-to-peer technique sends knowledge via a bottleneck at one most important node earlier than sharing it with others, which might gradual issues down when you have got a number of GPUs.
  • Wants Additional GPU Reminiscence: The intelligent broadcast system makes use of extra GPU reminiscence to hurry issues up. On machines with much less reminiscence, it switches to a slower, much less environment friendly course of.
  • Restricted Help for FP8 Fashions: When you’re working with the newer FP8 quantized fashions, you’ll want some experimental patches. And even then, not all fashions play properly, but past a few examined ones.

Conclusion

Moonshot AI’s Checkpoint Engine is a game-changer for updating enormous AI fashions with out stopping them. It retains the whole lot operating easily, even whereas the mannequin’s “mind” is getting smarter in real-time. Whereas it nonetheless has a number of areas to enhance, the potential is big. When you’re working with massive AI methods, this software is certainly price watching. It’s serving to make the way forward for AI sooner and extra environment friendly, with none downtime.

Incessantly Requested Questions

Q1. What downside does Checkpoint Engine clear up?

A. It lets massive language fashions replace weights in real-time throughout inference with out downtime, so AI methods keep on-line whereas bettering.

Q2. Which frameworks does Checkpoint Engine help?

A. Proper now, it’s primarily built-in and examined with the vLLM inference framework.

Q3. What’s the distinction between Broadcast and P2P strategies?

A. Broadcast is quicker for synchronized updates throughout many GPUs, whereas P2P permits versatile updates when cases be a part of or depart.

I’m a Information Science Trainee at Analytics Vidhya, passionately engaged on the event of superior AI options akin to Generative AI purposes, Giant Language Fashions, and cutting-edge AI instruments that push the boundaries of expertise. My function additionally includes creating partaking academic content material for Analytics Vidhya’s YouTube channels, growing complete programs that cowl the total spectrum of machine studying to generative AI, and authoring technical blogs that join foundational ideas with the most recent improvements in AI. Via this, I intention to contribute to constructing clever methods and share data that evokes and empowers the AI group.

Login to proceed studying and revel in expert-curated content material.

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Stay Connected

0FansLike
0FollowersFollow
0SubscribersSubscribe
- Advertisement -spot_img

Latest Articles