Обновить

Комментарии 3

Сходил в строительный магазин, купил дрель, пересобрал её дома, смазал как следует…

Всем пока!

Мне тоже пришлось допиливать:

OOM костыли

What is now installed

I installed a vLLM-specific guard:

  /etc/systemd/system/vllm-oom-guard.service
  /usr/local/sbin/vllm-oom-guard.sh

It watches:

  • MemAvailable

  • swap free

  • memory PSI pressure

  • kernel journal lines like:

    • NV_ERR_NO_MEMORY

    • oom-kill

    • Killed process

    • NVRM: Xid

On trigger, it runs:

  docker rm -f qwen36-vllm
  tmux kill-session -t qwen36eval
  pkill -KILL -f 'run-gemma4-vllm.sh|run-qwen36-vllm.sh|VLLM::|vllm...'

It already caught the later 160k Google Gemma failure:

  kernel-event NVRM ... NV_ERR_NO_MEMORY
  KILL reason=kernel-event

So that part is now working.

About auto-reboot

I also set this as last-resort recovery:

  kernel.hung_task_timeout_secs = 120
  kernel.hung_task_panic = 1
  kernel.panic = 20

Meaning:

  if a kernel task is stuck >120s,
  panic,
  then reboot after 20s

That is not for normal OOM. It is for the case you just hit manually: system wedged so badly that only the power button fixed it.

Тесты:

У меня была задача запустить 2 “лучшие” плотные модели с FP8 квантом, вот что добился:

Dual FP8

Memory/startup breakdown:

  Gemma 31B FP8-block
  port: 8012
  container: dual-gemma-fp8

  max_model_len: 150000
  gpu_memory_utilization: 0.44
  model weights loaded: 31.70 GiB
  available KV cache: 17.93 GiB
  GPU KV cache size: 290,948 tokens
  max concurrency @150k: 1.94x
  actual cap: max_num_seqs=1
  GPU process memory: ~50.1 GiB
  Qwen3.6 27B FP8
  port: 8011
  container: dual-qwen-fp8

  max_model_len: 150000
  gpu_memory_utilization: 0.325
  model weights loaded: 28.51 GiB
  available KV cache: 4.98 GiB
  GPU KV cache size: 152,941 tokens
  max concurrency @150k: 1.02x
  actual cap: max_num_seqs=1
  GPU process memory: ~34.4 GiB

Current host/container memory:

  dual-gemma-fp8 docker RAM: ~5.37 GiB / 68 GiB
  dual-qwen-fp8  docker RAM: ~7.13 GiB / 50 GiB
  host RAM used: ~100 GiB
  host RAM available: ~20 GiB

Important distinction:

  • vLLM “max concurrency” means KV capacity for full 150k-token requests.

  • max_num_seqs=1 means each server is currently limited to 1 active request.

  • Together, that gives 2 total parallel requests: one Qwen + one Gemma.

  • Qwen is the tight one: 1.02x, so it cannot safely do 2 full-context requests at 150k.

Context length:

  Currently both are served with max_model_len=150000.
  • Gemma config:

export PATH="$HOME/.local/bin:$PATH"; model-shelf resolve RedHatAI/gemma-4-31B-it-FP8-block --format safetensors && docker run --rm \
  --name dual-gemma-fp8 \
  --init \
  --gpus all \
  --ipc=host \
  --shm-size=32g \
  --memory 68g \
  --memory-swap 68g \
  --oom-score-adj 900 \
  -p 0.0.0.0:8012:8000 \
  -v $HOME/.cache/model-shelf/models:/models:ro \
  vllm/vllm-openai:nightly-aarch64 \
  /models/safetensors/RedHatAI/gemma-4-31B-it-FP8-block \
  --served-model-name RedHatAI/gemma-4-31B-it-FP8-block \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 150000 \
  --gpu-memory-utilization 0.44 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8 \
  --async-scheduling \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --load-format fastsafetensors \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template /vllm-workspace/examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}'
  • Qwen config:

export PATH="$HOME/.local/bin:$PATH"; model-shelf resolve Qwen/Qwen3.6-27B-FP8 --format safetensors && docker run --rm \
  --name dual-qwen-fp8 \
  --init \
  --gpus all \
  --ipc=host \
  --shm-size=32g \
  --memory 50g \
  --memory-swap 50g \
  --oom-score-adj 900 \
  -p 0.0.0.0:8011:8000 \
  -v $HOME/.cache/model-shelf/models:/models:ro \
  vllm/vllm-openai:nightly-aarch64 \
  /models/safetensors/Qwen/Qwen3.6-27B-FP8 \
  --served-model-name Qwen/Qwen3.6-27B-FP8 \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --dtype bfloat16 \
  --max-model-len 150000 \
  --gpu-memory-utilization 0.325 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 8192 \
  --kv-cache-dtype fp8 \
  --async-scheduling \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --load-format fastsafetensors \
  --attention-backend flashinfer \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --default-chat-template-kwargs '{"enable_thinking": false}'
Зарегистрируйтесь на Хабре, чтобы оставить комментарий

Публикации