
Комментарии 3
Сходил в строительный магазин, купил дрель, пересобрал её дома, смазал как следует…
Всем пока!
Мне тоже пришлось допиливать:
OOM костыли
What is now installed
I installed a vLLM-specific guard:
/etc/systemd/system/vllm-oom-guard.service
/usr/local/sbin/vllm-oom-guard.sh
It watches:
MemAvailable
swap free
memory PSI pressure
kernel journal lines like:
NV_ERR_NO_MEMORY
oom-kill
Killed process
NVRM: Xid
On trigger, it runs:
docker rm -f qwen36-vllm
tmux kill-session -t qwen36eval
pkill -KILL -f 'run-gemma4-vllm.sh|run-qwen36-vllm.sh|VLLM::|vllm...'
It already caught the later 160k Google Gemma failure:
kernel-event NVRM ... NV_ERR_NO_MEMORY
KILL reason=kernel-event
So that part is now working.
About auto-reboot
I also set this as last-resort recovery:
kernel.hung_task_timeout_secs = 120
kernel.hung_task_panic = 1
kernel.panic = 20
Meaning:
if a kernel task is stuck >120s,
panic,
then reboot after 20s
That is not for normal OOM. It is for the case you just hit manually: system wedged so badly that only the power button fixed it.
Тесты:
У меня была задача запустить 2 “лучшие” плотные модели с FP8 квантом, вот что добился:
Dual FP8
Memory/startup breakdown:
Gemma 31B FP8-block
port: 8012
container: dual-gemma-fp8
max_model_len: 150000
gpu_memory_utilization: 0.44
model weights loaded: 31.70 GiB
available KV cache: 17.93 GiB
GPU KV cache size: 290,948 tokens
max concurrency @150k: 1.94x
actual cap: max_num_seqs=1
GPU process memory: ~50.1 GiB
Qwen3.6 27B FP8
port: 8011
container: dual-qwen-fp8
max_model_len: 150000
gpu_memory_utilization: 0.325
model weights loaded: 28.51 GiB
available KV cache: 4.98 GiB
GPU KV cache size: 152,941 tokens
max concurrency @150k: 1.02x
actual cap: max_num_seqs=1
GPU process memory: ~34.4 GiB
Current host/container memory:
dual-gemma-fp8 docker RAM: ~5.37 GiB / 68 GiB
dual-qwen-fp8 docker RAM: ~7.13 GiB / 50 GiB
host RAM used: ~100 GiB
host RAM available: ~20 GiB
Important distinction:
vLLM “max concurrency” means KV capacity for full 150k-token requests.
max_num_seqs=1 means each server is currently limited to 1 active request.
Together, that gives 2 total parallel requests: one Qwen + one Gemma.
Qwen is the tight one: 1.02x, so it cannot safely do 2 full-context requests at 150k.
Context length:
Currently both are served with max_model_len=150000.
Gemma config:
export PATH="$HOME/.local/bin:$PATH"; model-shelf resolve RedHatAI/gemma-4-31B-it-FP8-block --format safetensors && docker run --rm \
--name dual-gemma-fp8 \
--init \
--gpus all \
--ipc=host \
--shm-size=32g \
--memory 68g \
--memory-swap 68g \
--oom-score-adj 900 \
-p 0.0.0.0:8012:8000 \
-v $HOME/.cache/model-shelf/models:/models:ro \
vllm/vllm-openai:nightly-aarch64 \
/models/safetensors/RedHatAI/gemma-4-31B-it-FP8-block \
--served-model-name RedHatAI/gemma-4-31B-it-FP8-block \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 150000 \
--gpu-memory-utilization 0.44 \
--max-num-seqs 1 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8 \
--async-scheduling \
--enable-prefix-caching \
--enable-chunked-prefill \
--load-format fastsafetensors \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template /vllm-workspace/examples/tool_chat_template_gemma4.jinja \
--limit-mm-per-prompt '{"image": 4, "audio": 1}'
Qwen config:
export PATH="$HOME/.local/bin:$PATH"; model-shelf resolve Qwen/Qwen3.6-27B-FP8 --format safetensors && docker run --rm \
--name dual-qwen-fp8 \
--init \
--gpus all \
--ipc=host \
--shm-size=32g \
--memory 50g \
--memory-swap 50g \
--oom-score-adj 900 \
-p 0.0.0.0:8011:8000 \
-v $HOME/.cache/model-shelf/models:/models:ro \
vllm/vllm-openai:nightly-aarch64 \
/models/safetensors/Qwen/Qwen3.6-27B-FP8 \
--served-model-name Qwen/Qwen3.6-27B-FP8 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--dtype bfloat16 \
--max-model-len 150000 \
--gpu-memory-utilization 0.325 \
--max-num-seqs 1 \
--max-num-batched-tokens 8192 \
--kv-cache-dtype fp8 \
--async-scheduling \
--enable-prefix-caching \
--enable-chunked-prefill \
--load-format fastsafetensors \
--attention-backend flashinfer \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--default-chat-template-kwargs '{"enable_thinking": false}'
Ubuntu 26.04 на клоне DGX Spark (Asus GX10)