Пользователь
Добрый день, делюсь опытом.Стенд: Ryzen 5 5600 + b550m + 32Gb RAM + RTX3060 12GbСреда: Windows 10 (+ wsl2) + Docker + Llama.cpp server cudaЗадача: Hermes-agent + hermes-web-uiСкорость:Prompt processing 2154-2592 t/sГенерация (tg) 49-54 t/sпривожу свой выстраданный docker-compose.yaml :services: llama-cpp-server: image: ghcr.io/ggml-org/llama.cpp:server-cuda container_name: llama-server restart: unless-stopped pull_policy: always ports: - "8080:8080" volumes: - ./models:/models:ro environment: - CUDA_VISIBLE_DEVICES=0 command: # ===== Модель и проектор ===== - "-m" - "/models/Gemma-4-E4B-Claude-Abliterated.Q4_K_M/Gemma-4-E4B-Claude-Abliterated.Q4_K_M.gguf" - "--mmproj" - "/models/Gemma-4-E4B-Claude-Abliterated.Q4_K_M/gemma-4-E4B-it-mmproj-BF16.gguf" - "--jinja" #- "--chat-template" #- "/models/Gemma-4-E4B-Claude-Abliterated.Q4_K_M/chat_template.jinja" # ===== Сеть ===== - "--host" - "0.0.0.0" - "--port" - "8080" # ===== GPU и фиксация в VRAM ===== - "-ngl" - "999" - "--no-mmap" # ===== Контекст и слоты ===== - "-c" - "98304" - "-np" - "2" - "--kv-unified" - "--swa-full" # ===== Потоки CPU ===== - "-t" - "3" # ===== Flash Attention ===== - "-fa" - "on" # ===== Батчинг ===== - "-b" - "4096" - "-ub" - "1024" # ===== KV-квантование ===== #- "-ctk" #- "q8_0" #- "-ctv" #- "q8_0" # ===== Семплирование (Gemma-оптимизированное) ===== - "--temp" - "1.0" - "--top-p" - "0.95" - "--top-k" - "64" # ===== Таймаут ===== - "--timeout" - "600" deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu]пользуйтесь.
Попробуй turboquant4. Или связку mtp + turboquant4.
Добрый день, делюсь опытом.
Стенд: Ryzen 5 5600 + b550m + 32Gb RAM + RTX3060 12Gb
Среда: Windows 10 (+ wsl2) + Docker + Llama.cpp server cuda
Задача: Hermes-agent + hermes-web-ui
Скорость:
Prompt processing
2154-2592 t/s
Генерация (tg)
49-54 t/s
привожу свой выстраданный docker-compose.yaml :
services:
llama-cpp-server:
image: ghcr.io/ggml-org/llama.cpp:server-cuda
container_name: llama-server
restart: unless-stopped
pull_policy: always
ports:
- "8080:8080"
volumes:
- ./models:/models:ro
environment:
- CUDA_VISIBLE_DEVICES=0
command:
# ===== Модель и проектор =====
- "-m"
- "/models/Gemma-4-E4B-Claude-Abliterated.Q4_K_M/Gemma-4-E4B-Claude-Abliterated.Q4_K_M.gguf"
- "--mmproj"
- "/models/Gemma-4-E4B-Claude-Abliterated.Q4_K_M/gemma-4-E4B-it-mmproj-BF16.gguf"
- "--jinja"
#- "--chat-template"
#- "/models/Gemma-4-E4B-Claude-Abliterated.Q4_K_M/chat_template.jinja"
# ===== Сеть =====
- "--host"
- "0.0.0.0"
- "--port"
- "8080"
# ===== GPU и фиксация в VRAM =====
- "-ngl"
- "999"
- "--no-mmap"
# ===== Контекст и слоты =====
- "-c"
- "98304"
- "-np"
- "2"
- "--kv-unified"
- "--swa-full"
# ===== Потоки CPU =====
- "-t"
- "3"
# ===== Flash Attention =====
- "-fa"
- "on"
# ===== Батчинг =====
- "-b"
- "4096"
- "-ub"
- "1024"
# ===== KV-квантование =====
#- "-ctk"
#- "q8_0"
#- "-ctv"
#- "q8_0"
# ===== Семплирование (Gemma-оптимизированное) =====
- "--temp"
- "1.0"
- "--top-p"
- "0.95"
- "--top-k"
- "64"
# ===== Таймаут =====
- "--timeout"
- "600"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
пользуйтесь.
Попробуй turboquant4. Или связку mtp + turboquant4.