Local Chatbot Without Limits: A Guide to LM Studio and Open LLMs / Habr

In this article, we will not only install a local (and free) alternative to ChatGPT, but also review the most important open LLMs, delve into the advanced settings of LM Studio, connect the chatbot to Visual Studio Code, and teach it to assist us with programming. We will also look at how to fine-tune the model's behavior using system prompts.

An LLM (Large Language Model) is a generative neural network trained on vast amounts of text. It can understand queries, engage in dialogue, and generate coherent text based on a given context. In common parlance, it's a 'chatbot' (although this word existed long before the advent of neural networks).

Why?

After all, there's ChatGPT, Claude, DeepSeek, Gemini...

In fact, there are many reasons to host a chatbot on your own computer. Here are just a few:

Privacy. Not a single byte of data leaves for third-party servers. This is especially important if we are working with sensitive or confidential information: finances, medicine, corporate projects. For example, several Samsung engineers recently accidentally uploaded confidential source code to ChatGPT — that is, to the server of a private company, OpenAI!
If they had read this article, they would have simply installed LM Studio and avoided a reprimand from their boss (or being fired).
No censorship or restrictions. Almost all cloud-based LLMs have strict filters and moderation. There are topics they will simply refuse to discuss with you — be it technical details, politics, security, or even philosophy. Yes, sometimes restrictions can be bypassed with clever 'prompt engineering,' but there is no complete freedom in the cloud — it poses risks for businesses, which will always prefer to play it safe.
Support for different models. In the cloud, you can only interact with the models provided by the service. Locally, however, we can run any open LLM suitable for a specific task: Mistral for speed, LLaMA3 for response quality, DeepSeek-Coder or CodeGemma as a coding assistant.
Integration into projects — we can integrate our own model into a Telegram bot, our AI startup, or a coding assistant in an IDE. Even if our project will run on a cloud-based LLM in production, it's better to test locally.
Training and customization. In the cloud, you can't fine-tune proprietary models like GPT-4o or Claude — not even for money. The most you can do is customize them using a system prompt or an 'instructional' communication style. Locally, however, we can perform fine-tuning, connect RAG, and adjust the model's style and behavior, giving us full control over the process.
Free of charge. Any cloud service either requires a subscription or limits the number of tokens per day or per month. With a local LLM, we are only limited by our computer's resources. And why pay for a Cursor subscription when you can set up a local coding assistant in Visual Studio Code for free?

Any downsides?

Of course, there are. It's not always possible to run the same model that works in the cloud:

We might not have enough hardware resources for the full version of the model and will have to use a lightweight one (for example, the cloud version of DeepSeek has 685 billion parameters, whereas my RTX 4070 Ti Super already lags with a 32 billion parameter model). And without at least 16 GB of RAM, it's a hopeless endeavor from the start.
Some models, in addition to the reason above, are simply not publicly available — such as ChatGPT-4o, Claude 3, and Gemini 1.5.
Because of the two points above, we have to run lightweight versions of the models. They are faster and lighter, but:
- are less accurate
- may give more 'flat' answers
- don't always handle complex tasks as well as GPT-4o or Claude

Of course, if we have a cluster of server-grade GPUs, we can run the much-hyped DeepSeek-685B* without compromise — but most users will have to settle for lighter models.

*the number before 'b', for example, 658b — indicates how many billions (billions) of parameters are in this version of the model. The larger the number, the better the model's reasoning, but the more demanding it is on hardware. The sweet spot for standard consumer hardware with a GPU is considered to be 16–22b.

What hardware is needed for an LLM?

Although running local models is possible even on a laptop, the user experience heavily depends on the configuration.

Minimum requirements for running:

RAM: from 16 GB, preferably 32 GB
GPU: any with 6–8 GB VRAM, for example, RTX 3060 / 4060
Apple M1/M2/M3 (16–24 GB RAM)
What you can run: models up to 7B parameters (Q4/K_M)

Good options:

MacBook Pro M1/M2/M3 with 16+ GB RAM
PC with RTX 3060 / 4060 / RX 7600

Optimal level (without lags):

RAM: 32–64 GB
GPU: RTX 4070 / 4070 Ti / 4070 Ti Super / RX 7900 XT
What you can run comfortably: Models up to 13B–22B parameters (including DeepSeek-Coder-6.7B and LLaMA 13B)
This setup allows you to:
Work in an IDE and run the model in parallel
Use the assistant in 'near real-time' mode

Enthusiast or development under load:

RAM: from 64 GB
GPU: RTX 4090 (24 GB VRAM) or A6000 / H100
Models: up to 33B–70B, including Mixtral, DeepSeek-Coder-33B

On such machines, you can:

Run benchmarks, RAG, and fine-tuning
Use models on par with ChatGPT-3.5 in terms of quality and speed

tl;dr

≤ 9 b — laptops with RTX 4060 / MacBook M1 16 GB, real-time
9 – 22 b — RTX 4070/7900 XT, <1 s/token
22 – 70 b — RTX 4090 24 GB or A6000, 'workable' speed
70 b + MoE — a single RTX 4090 can handle it (20 B active), but 2×GPU is better
> 200b — only multi-GPU or a cluster (H100, A100)

Model	Parameters	GPU	Estimated Speed
DeepSeek 685B	685 billion	Clusters with 8× H100 (80 GB)	~ real-time
DeepSeek-Coder 33B	33 billion	RTX Pro 6000	~ real-time
DeepSeek-Coder 33B	33 billion	RTX 4070 Ti Super	extremely slow
DeepSeek-Coder 6.7B	6.7 billion	RTX 4070 Ti Super	almost instantly

LM Studio

LM Studio is one of the most user-friendly desktop applications for running local LLMs.

More experienced users might prefer Ollama — it's more flexible and better suited for automation, but it doesn't have a graphical interface 'out of the box' (though one can be connected separately). For most tasks involving language models, LM Studio is more than sufficient — especially since both programs use the same engine under the hood: llama.cpp.

As of this writing, LM Studio can:

Provide a ChatGPT-like interface for dialogue with the model. Dialogues can be duplicated, and messages can be arbitrarily deleted and edited — in short, much more freedom than in ChatGPT.
A discovery service for models with previews — you can find language models directly in the LM Studio window and even filter models suitable for your hardware. You can also download models from HuggingFace.
Download and switch between language models in one click.
Configure the system prompt. This allows you to define the model's 'personality': communication style, role, tone, and behavior.
Act as a local server with an OpenAI-compatible API. You can connect the model to a Telegram bot, use it in third-party applications, or use the model as an engine for an AI assistant in an IDE.
Change generation parameters — top_p, top_k and others. More on this below.
MCP server.
RAG — allows you to upload PDF documents and have a dialogue based on their content. Large documents will be indexed like a classic RAG, while smaller documents will be loaded entirely into the context.

First launch

LM Studio is available on Mac, Windows (incl. Arm), and Linux, and installation requires no special steps. Just go here, choose your platform, and install.

After installation, we see the start window:

By default, the interface is set to User mode, but we're all adults here, so let's switch to Developer mode right away:

Next, click on 'Select a model to load' and LM Studio will thoughtfully suggest gemma-3 in a build suitable for our hardware:

Waiting for the 6–8 GB LLM model to download...

Модели можно скармливать текстовые файлы и, если она поддерживает - изображения — You can feed the model text files and, if supported, images

Download, chat, PROFIT!
Can we end the tutorial now? Not so fast.

Models

LM Studio allows us to download models in two ways — through its own marketplace (the purple magnifying glass button) or through external sites, like HuggingFace.

In the built-in marketplace, models with reasoning, image recognition, and those adapted for use as part of tools are conveniently marked.
And now let's take a break from LM Studio itself and look at the main open LLMs. There are base models: LLaMA, Mistral, Gemma, Qwen, DeepSeek, and their fine-tuned versions specializing in more 'playful' communication, coding, censorship removal, and specific communication scenarios.

Quantization (Q)

In model names, besides the size (e.g., 24b), we often see suffixes like Q4_K_M. This means the model is quantized — compressed with some loss of quality, like a JPEG, but for neural networks instead of images.
All models available for download via LM Studio are already quantized, which allows them to run on standard consumer hardware without server-grade GPUs.
Quantization is a trade-off between accuracy and performance: the model takes up less memory and runs faster, but may lose a bit of quality.
If you want to dive into the technical details, I have a separate article on quantization.

For now, it's enough to remember:

the higher the number after Q — the more accurate the model, but the harder it is to run. Q8 - preserves the most quality but requires more VRAM. Q2 and Q3 - are too lossy. The optimal compromise is Q4_K_M or Q5_K_M.

Base LLM Models

LLaMA (Meta*)
The new LLaMA 4 lineup was released in spring 2025 and already includes LLaMA 4 Scout (8B) and Maverick (40B) versions. These are the most powerful open-weight LLMs from Meta to date, with output quality approaching that of GPT-4. Even Scout-8B performs confidently in reasoning tasks, and Maverick-40B surpasses ChatGPT-3.5.
LLaMA models are the most popular for fine-tuning and custom builds. However, Meta's license restricts commercial use, especially in products that compete with Meta's own services (e.g., chatbots and assistants).

Gemma (Google)
A lightweight open-source version from Google, based on Gemini developments. It performs quite well even on low-end hardware and is easy to fine-tune. It is distributed under the Apache 2.0 license — one of the most permissive. However, Google reserves the right to terminate usage upon suspicion of violating its rules. This restriction also applies to derivative builds.

Qwen (Alibaba)
The current Qwen 3 lineup shows excellent results in benchmarks, especially in programming, math, and multilingual reasoning tasks. Both powerful MoE models (e.g., 235B) and compact versions from 0.5B are available — including builds for ARM and systems without GPUs. The models are distributed under the open Apache 2.0 license, but some weight classes (especially large MoE models) may have restrictions on use in China and in cloud products, which should be considered for commercial application.

DeepSeek (DeepSeek AI)
The very same DeepSeek that made waves in early 2025. Today, both general-purpose language models (from 1.3B to 236B parameters in an MoE architecture) and specialized DeepSeek-Coder V2/V3 models for programming are available.
Of particular interest to us is DeepSeek-Coder V2–33B, which demonstrates quality comparable to GPT-4 in coding tasks (according to HumanEval++ and other benchmarks).

Below is a brief table with the main characteristics of these models:

Model	Developer	Strengths	Weaknesses
LLaMA 4 Scout / Maverick (8b / 40b) LLaMA 4 Behemoth (announced, 400b)	Meta	High quality, powerful base for fine-tuning, rich ecosystem	License restricts commercial use
Gemma 3 (1b / 4b / 12 B / 27 B)	Google	multimodal (text + image), long-context 128k, 140+ languages, Apache 2.0 GPL	License has restrictions, base 1b version without vision
Mistral Small 3.1 / Devstral-24B	Mistral AI	Context up to 128k, powerful reasoning ability	Requires a lot of VRAM
Mixtral 8×22B-Instruct	Mistral AI	MoE, high performance, 128k context	High hardware requirements
Qwen 3 (0.6–32b, 235b MoE)	Alibaba	Good at code and math, multilingual, long-context 128k, Apache 2.0 GPL	filters for 'critical' content are still present, resource-intensive
DeepSeek Coder V2/V3 (active ~21–37b)	DeepSeek AI	MoE, expert in coding and code analysis	Very demanding on resources and settings
StarCoder 2 (7b / 15b)	Hugging Face / BigCode	Optimized for code, long-context >100k, excellent for Dev scenarios	Not intended for general dialogue
Phi-3 Mini / Small / Med	Microsoft	Compact, CPU-friendly, up to 128k context	Limited in complex reasoning
DBRX (132b, active 36b)	Databricks / MosaicML	MoE, good for code/math, long-context (>100k)	Requires a lot of VRAM, small community for now
Command-R+ (35b)	Cohere	Optimized for RAG, structures JSON output, 200k context, Apache 2.0	for 35b, >= 24 GB VRAM is needed, less flexible as a chat assistant

My subjective selection of models

for conversation:

LLaMA 3 8B Instruct
Nous-Hermes-2-LLaMA3-8B-GGUF
openchat-4
Gemma 2-9B-Instruct (lightweight for low-end systems)

for coding:

StarCoder2–15B
Mixtral-8×7B-Instruct-v0.1
deepseek-coder-6.7B-Instruct

For role-playing / no censorship:

MythoMax-L2
dolphin-2.7-mixtral-8×7b

RAG / API:

Command-R+
DBRX

LM Studio Settings

Now that we have downloaded the models we are interested in, we can manage them (view and delete) through the My Models menu (red folder):

задача для самоконтроля: убедитесь, что уже понимаете, что означают цифры возле b и Q — self-check task: make sure you already understand what the numbers next to 'b' and 'Q' mean

LM Studio gives us access to a whole range of parameters that directly affect the model's behavior and response style. If you want the assistant to be serious or, conversely, playful, or to have certain blocks needed for our project — this can be done in a couple of clicks.

Нажимаем на кнопку Show Settings (иконка мензурки) — Click the Show Settings button (the flask icon)

System Context (System Prompt)

This is an introductory instruction that defines the model's 'personality.' Example: 'You are a technical assistant. Answer briefly and strictly to the point, without unnecessary fluff or disclaimers.' The System Context acts as a basic behavioral firmware — everything the model says will pass through this prism.

Инструктаж отшучиваться от любых прямых ответов в системном промпте — Instructing it to joke its way out of any direct answers in the system prompt

Model Parameters

Temperature — is responsible for the model's 'creativity.' At a low value (0.2–0.5), responses will be precise, concise, and almost template-like — well-suited for technical support or brief instructions. At a high value (0.8–1.2), the model starts to 'fantasize' — it more often chooses less probable words, creating more lively, non-standard, and creative texts.

Top-k and Top-p (Nucleus Sampling) — both parameters control how many text continuation options the model considers for each token.

Top-k limits the choice: if k = 40, the model chooses from the 40 most probable words.
Top-p defines a 'probability threshold': if p = 0.9, words that collectively reach 90% probability are considered. By lowering these values, we make the responses more predictable; by increasing them, we give more room for creativity.

Repeat Penalty — helps combat model looping or phrase repetition. A value of 1.1–1.2 is considered a good starting point: it doesn't prevent the model from completing sentences normally but keeps it from getting stuck on the same phrases. If the model writes 'yes-yes-yes' or 'here's an example, example, example' — you should increase this setting.

Max Tokens — directly limits the length of the response. Useful if you need a short explanation, not a wall of text. If the model 'gets carried away' and writes more than necessary, we set a limit, for example, 200 or 512 tokens.

Structured Output — this is when the model responds not just with text, but strictly in a specific format:

JSON
YAML
Markdown table
Formatted code

In LM Studio, you can explicitly ask the model to: adhere to a format (e.g., JSON) or respond according to a template (e.g.: {"вопрос": "…", "ответ": "…"}) This works with a well-thought-out prompt or an instruction in the System Context. This is especially useful if the responses will go to a Telegram bot, an API, a database, or an IDE. Here is an example of such a prompt:

You are a financial analyst. Respond strictly in JSON format: {"рекомендация": "string", "причина": "string"}

Since this feature relies entirely on the model's intelligence, some models handle the JSON format better than others.

Local API Server

In addition to its GUI, LM Studio can function as a local server fully compatible with the OpenAI API standard. This means that any application that works with LLMs via HTTP requests can use a local model through LM Studio.

Here are typical scenarios:

Connecting to a Telegram bot
Integration into your own web application or CLI
Working in an IDE via plugins (e.g., Continue for VS Code)

Even if we plan to use a paid model like ChatGPT or Claude in the final production version, it's more convenient (and free) to connect to local LLMs during the development stage.

To do this, go to the Developer tab (green console) and enable the server. The default server address is:

http://localhost:1234/v1

На этом скрине я поменял адрес сервера для inter-ops со своим WSL - вам, скорее всего, это не придётся делать — In this screenshot, I changed the server address for inter-ops with my WSL — you most likely won't need to do this

Coding Assistant

Now let's move on to another practical use of the API server — connecting a coding assistant. This is not a full-fledged guide to vibe-coding, so we will only briefly look at how to connect LM Studio to Continue — a wrapper plugin for integrating LLMs into Visual Studio Code.

Install the Continue plugin from the Marketplace.
In LM Studio, enable Developer Mode and start the API server. A message about the server starting should appear in the console.
In the Continue settings, find Models → + New Assistant. In the config.yaml that opens, add the model settings:

Example settings. The model name must match the exact ID in LM Studio.

name: Local Assistant version: 1.0.0 schema: v1 models: - name: Qwen LM Studio provider: openai model: qwen/qwen2.5-coder-14b apiBase:http://localhost:1234/v1apiKey: "" roles: - chat - edit - apply context: - provider: code - provider: docs - provider: diff - provider: terminal - provider: problems - provider: folder - provider: codebase

Now our code assistant works locally — and for free.

And if you're one of those Samsung engineers who previously sent confidential code to an external server — now your boss will be pleased with you!

In future tutorials, we will look at Ollama and more extensive configuration of AI assistants for coding.