
In this article, we will not only install a local (and free) alternative to ChatGPT, but also review the most important open LLMs, delve into the advanced settings of LM Studio, connect the chatbot to Visual Studio Code, and teach it to assist us with programming. We will also look at how to fine-tune the model's behavior using system prompts.
An LLM (Large Language Model) is a generative neural network trained on vast amounts of text. It can understand queries, engage in dialogue, and generate coherent text based on a given context. In common parlance, it's a 'chatbot' (although this word existed long before the advent of neural networks).
Why?
After all, there's ChatGPT, Claude, DeepSeek, Gemini...
In fact, there are many reasons to host a chatbot on your own computer. Here are just a few:
Privacy. Not a single byte of data leaves for third-party servers. This is especially important if we are working with sensitive or confidential information: finances, medicine, corporate projects. For example, several Samsung engineers recently accidentally uploaded confidential source code to ChatGPT — that is, to the server of a private company, OpenAI!
If they had read this article, they would have simply installed LM Studio and avoided a reprimand from their boss (or being fired).No censorship or restrictions. Almost all cloud-based LLMs have strict filters and moderation. There are topics they will simply refuse to discuss with you — be it technical details, politics, security, or even philosophy. Yes, sometimes restrictions can be bypassed with clever 'prompt engineering,' but there is no complete freedom in the cloud — it poses risks for businesses, which will always prefer to play it safe.
Support for different models. In the cloud, you can only interact with the models provided by the service. Locally, however, we can run any open LLM suitable for a specific task: Mistral for speed, LLaMA3 for response quality, DeepSeek-Coder or CodeGemma as a coding assistant.
Integration into projects — we can integrate our own model into a Telegram bot, our AI startup, or a coding assistant in an IDE. Even if our project will run on a cloud-based LLM in production, it's better to test locally.
Training and customization. In the cloud, you can't fine-tune proprietary models like GPT-4o or Claude — not even for money. The most you can do is customize them using a system prompt or an 'instructional' communication style. Locally, however, we can perform fine-tuning, connect RAG, and adjust the model's style and behavior, giving us full control over the process.
Free of charge. Any cloud service either requires a subscription or limits the number of tokens per day or per month. With a local LLM, we are only limited by our computer's resources. And why pay for a Cursor subscription when you can set up a local coding assistant in Visual Studio Code for free?
Any downsides?
Of course, there are. It's not always possible to run the same model that works in the cloud:
We might not have enough hardware resources for the full version of the model and will have to use a lightweight one (for example, the cloud version of DeepSeek has 685 billion parameters, whereas my RTX 4070 Ti Super already lags with a 32 billion parameter model). And without at least 16 GB of RAM, it's a hopeless endeavor from the start.
Some models, in addition to the reason above, are simply not publicly available — such as ChatGPT-4o, Claude 3, and Gemini 1.5.
Because of the two points above, we have to run lightweight versions of the models. They are faster and lighter, but:
are less accurate
may give more 'flat' answers
don't always handle complex tasks as well as GPT-4o or Claude
Of course, if we have a cluster of server-grade GPUs, we can run the much-hyped DeepSeek-685B* without compromise — but most users will have to settle for lighter models.
*the number before 'b', for example, 658b — indicates how many billions (billions) of parameters are in this version of the model. The larger the number, the better the model's reasoning, but the more demanding it is on hardware. The sweet spot for standard consumer hardware with a GPU is considered to be 16–22b.
What hardware is needed for an LLM?
Although running local models is possible even on a laptop, the user experience heavily depends on the configuration.
Minimum requirements for running:
RAM: from 16 GB, preferably 32 GB
GPU: any with 6–8 GB VRAM, for example, RTX 3060 / 4060
Apple M1/M2/M3 (16–24 GB RAM)
What you can run: models up to 7B parameters (Q4/K_M)
Good options:
MacBook Pro M1/M2/M3 with 16+ GB RAM
PC with RTX 3060 / 4060 / RX 7600
Optimal level (without lags):
RAM: 32–64 GB
GPU: RTX 4070 / 4070 Ti / 4070 Ti Super / RX 7900 XT
What you can run comfortably: Models up to 13B–22B parameters (including DeepSeek-Coder-6.7B and LLaMA 13B)
This setup allows you to:
Work in an IDE and run the model in parallel
Use the assistant in 'near real-time' mode
Enthusiast or development under load:
RAM: from 64 GB
GPU: RTX 4090 (24 GB VRAM) or A6000 / H100
Models: up to 33B–70B, including Mixtral, DeepSeek-Coder-33B
On such machines, you can:
Run benchmarks, RAG, and fine-tuning
Use models on par with ChatGPT-3.5 in terms of quality and speed
tl;dr
≤ 9 b — laptops with RTX 4060 / MacBook M1 16 GB, real-time
9 – 22 b — RTX 4070/7900 XT, <1 s/token
22 – 70 b — RTX 4090 24 GB or A6000, 'workable' speed
70 b + MoE — a single RTX 4090 can handle it (20 B active), but 2×GPU is better
> 200b — only multi-GPU or a cluster (H100, A100)
Model | Parameters | GPU | Estimated Speed |
|---|---|---|---|
DeepSeek 685B | 685 billion | Clusters with 8× H100 (80 GB) | ~ real-time |
DeepSeek-Coder 33B | 33 billion | RTX Pro 6000 | ~ real-time |
DeepSeek-Coder 33B | 33 billion | RTX 4070 Ti Super | extremely slow |
DeepSeek-Coder 6.7B | 6.7 billion | RTX 4070 Ti Super | almost instantly |
LM Studio
LM Studio is one of the most user-friendly desktop applications for running local LLMs.
More experienced users might prefer Ollama — it's more flexible and better suited for automation, but it doesn't have a graphical interface 'out of the box' (though one can be connected separately). For most tasks involving language models, LM Studio is more than sufficient — especially since both programs use the same engine under the hood: llama.cpp.
As of this writing, LM Studio can:
Provide a ChatGPT-like interface for dialogue with the model. Dialogues can be duplicated, and messages can be arbitrarily deleted and edited — in short, much more freedom than in ChatGPT.
A discovery service for models with previews — you can find language models directly in the LM Studio window and even filter models suitable for your hardware. You can also download models from HuggingFace.
Download and switch between language models in one click.
Configure the system prompt. This allows you to define the model's 'personality': communication style, role, tone, and behavior.
Act as a local server with an OpenAI-compatible API. You can connect the model to a Telegram bot, use it in third-party applications, or use the model as an engine for an AI assistant in an IDE.
Change generation parameters —
top_p,top_kand others. More on this below.MCP server.
RAG — allows you to upload PDF documents and have a dialogue based on their content. Large documents will be indexed like a classic RAG, while smaller documents will be loaded entirely into the context.
First launch
LM Studio is available on Mac, Windows (incl. Arm), and Linux, and installation requires no special steps. Just go here, choose your platform, and install.
After installation, we see the start window:

By default, the interface is set to User mode, but we're all adults here, so let's switch to Developer mode right away:

Next, click on 'Select a model to load' and LM Studio will thoughtfully suggest gemma-3 in a build suitable for our hardware:

Waiting for the 6–8 GB LLM model to download...

Download, chat, PROFIT!
Can we end the tutorial now? Not so fast.
Models
LM Studio allows us to download models in two ways — through its own marketplace (the purple magnifying glass button) or through external sites, like HuggingFace.

In the built-in marketplace, models with reasoning, image recognition, and those adapted for use as part of tools are conveniently marked.
And now let's take a break from LM Studio itself and look at the main open LLMs. There are base models: LLaMA, Mistral, Gemma, Qwen, DeepSeek, and their fine-tuned versions specializing in more 'playful' communication, coding, censorship removal, and specific communication scenarios.
Quantization (Q)
In model names, besides the size (e.g., 24b), we often see suffixes like Q4_K_M. This means the model is quantized — compressed with some loss of quality, like a JPEG, but for neural networks instead of images.
All models available for download via LM Studio are already quantized, which allows them to run on standard consumer hardware without server-grade GPUs.
Quantization is a trade-off between accuracy and performance: the model takes up less memory and runs faster, but may lose a bit of quality.
If you want to dive into the technical details, I have a separate article on quantization.
For now, it's enough to remember:
the higher the number after
Q— the more accurate the model, but the harder it is to run.Q8- preserves the most quality but requires more VRAM.Q2andQ3- are too lossy. The optimal compromise isQ4_K_MorQ5_K_M.
Base LLM Models
LLaMA (Meta*)
The new LLaMA 4 lineup was released in spring 2025 and already includes LLaMA 4 Scout (8B) and Maverick (40B) versions. These are the most powerful open-weight LLMs from Meta to date, with output quality approaching that of GPT-4. Even Scout-8B performs confidently in reasoning tasks, and Maverick-40B surpasses ChatGPT-3.5.
LLaMA models are the most popular for fine-tuning and custom builds. However, Meta's license restricts commercial use, especially in products that compete with Meta's own services (e.g., chatbots and assistants).
Gemma (Google)
A lightweight open-source version from Google, based on Gemini developments. It performs quite well even on low-end hardware and is easy to fine-tune. It is distributed under the Apache 2.0 license — one of the most permissive. However, Google reserves the right to terminate usage upon suspicion of violating its rules. This restriction also applies to derivative builds.
Qwen (Alibaba)
The current Qwen 3 lineup shows excellent results in benchmarks, especially in programming, math, and multilingual reasoning tasks. Both powerful MoE models (e.g., 235B) and compact versions from 0.5B are available — including builds for ARM and systems without GPUs.
The models are distributed under the open Apache 2.0 license, but some weight classes (especially large MoE models) may have restrictions on use in China and in cloud products, which should be considered for commercial application.
DeepSeek (DeepSeek AI)
The very same DeepSeek that made waves in early 2025. Today, both general-purpose language models (from 1.3B to 236B parameters in an MoE architecture) and specialized DeepSeek-Coder V2/V3 models for programming are available.
Of particular interest to us is DeepSeek-Coder V2–33B, which demonstrates quality comparable to GPT-4 in coding tasks (according to HumanEval++ and other benchmarks).
Below is a brief table with the main characteristics of these models:
Model | Developer | Strengths | Weaknesses |
LLaMA 4 Scout / Maverick (8b / 40b) | Meta | High quality, powerful base for fine-tuning, rich ecosystem | License restricts commercial use |
Gemma 3 (1b / 4b / 12 B / 27 B) | multimodal (text + image), long-context 128k, 140+ languages, Apache 2.0 GPL | License has restrictions, base 1b version without vision | |
Mistral Small 3.1 / Devstral-24B | Mistral AI | Context up to 128k, powerful reasoning ability | Requires a lot of VRAM |
Mixtral 8×22B-Instruct | Mistral AI | MoE, high performance, 128k context | High hardware requirements |
Qwen 3 (0.6–32b, 235b MoE) | Alibaba | Good at code and math, multilingual, long-context 128k, Apache 2.0 GPL | filters for 'critical' content are still present, resource-intensive |
DeepSeek Coder V2/V3 (active ~21–37b) | DeepSeek AI | MoE, expert in coding and code analysis | Very demanding on resources and settings |
StarCoder 2 (7b / 15b) | Hugging Face / BigCode | Optimized for code, long-context >100k, excellent for Dev scenarios | Not intended for general dialogue |
Phi-3 Mini / Small / Med | Microsoft | Compact, CPU-friendly, up to 128k context | Limited in complex reasoning |
DBRX (132b, active 36b) | Databricks / MosaicML | MoE, good for code/math, long-context (>100k) | Requires a lot of VRAM, small community for now |
Command-R+ (35b) | Cohere | Optimized for RAG, structures JSON output, 200k context, Apache 2.0 | for 35b, >= 24 GB VRAM is needed, less flexible as a chat assistant |
My subjective selection of models
for conversation:
LLaMA 3 8B Instruct
Nous-Hermes-2-LLaMA3-8B-GGUF
openchat-4
Gemma 2-9B-Instruct (lightweight for low-end systems)
for coding:
StarCoder2–15B
Mixtral-8×7B-Instruct-v0.1
deepseek-coder-6.7B-Instruct
For role-playing / no censorship:
MythoMax-L2
dolphin-2.7-mixtral-8×7b
RAG / API:
Command-R+
DBRX
LM Studio Settings
Now that we have downloaded the models we are interested in, we can manage them (view and delete) through the My Models menu (red folder):

LM Studio gives us access to a whole range of parameters that directly affect the model's behavior and response style. If you want the assistant to be serious or, conversely, playful, or to have certain blocks needed for our project — this can be done in a couple of clicks.

System Context (System Prompt)
This is an introductory instruction that defines the model's 'personality.' Example: 'You are a technical assistant. Answer briefly and strictly to the point, without unnecessary fluff or disclaimers.' The System Context acts as a basic behavioral firmware — everything the model says will pass through this prism.

Model Parameters

Temperature — is responsible for the model's 'creativity.' At a low value (0.2–0.5), responses will be precise, concise, and almost template-like — well-suited for technical support or brief instructions. At a high value (0.8–1.2), the model starts to 'fantasize' — it more often chooses less probable words, creating more lively, non-standard, and creative texts.
Top-k and Top-p (Nucleus Sampling) — both parameters control how many text continuation options the model considers for each token.
Top-k limits the choice: if k = 40, the model chooses from the 40 most probable words.
Top-p defines a 'probability threshold': if p = 0.9, words that collectively reach 90% probability are considered. By lowering these values, we make the responses more predictable; by increasing them, we give more room for creativity.
Repeat Penalty — helps combat model looping or phrase repetition. A value of 1.1–1.2 is considered a good starting point: it doesn't prevent the model from completing sentences normally but keeps it from getting stuck on the same phrases. If the model writes 'yes-yes-yes' or 'here's an example, example, example' — you should increase this setting.
Max Tokens — directly limits the length of the response. Useful if you need a short explanation, not a wall of text. If the model 'gets carried away' and writes more than necessary, we set a limit, for example, 200 or 512 tokens.
Structured Output — this is when the model responds not just with text, but strictly in a specific format:
JSON
YAML
Markdown table
Formatted code
In LM Studio, you can explicitly ask the model to: adhere to a format (e.g., JSON) or respond according to a template (e.g.: {"вопрос": "…", "ответ": "…"}) This works with a well-thought-out prompt or an instruction in the System Context. This is especially useful if the responses will go to a Telegram bot, an API, a database, or an IDE. Here is an example of such a prompt:
You are a financial analyst. Respond strictly in JSON format:
{"рекомендация": "string", "причина": "string"}
Since this feature relies entirely on the model's intelligence, some models handle the JSON format better than others.
Local API Server
In addition to its GUI, LM Studio can function as a local server fully compatible with the OpenAI API standard. This means that any application that works with LLMs via HTTP requests can use a local model through LM Studio.
Here are typical scenarios:
Connecting to a Telegram bot
Integration into your own web application or CLI
Working in an IDE via plugins (e.g., Continue for VS Code)
Even if we plan to use a paid model like ChatGPT or Claude in the final production version, it's more convenient (and free) to connect to local LLMs during the development stage.
To do this, go to the Developer tab (green console) and enable the server. The default server address is:http://localhost:1234/v1

Coding Assistant
Now let's move on to another practical use of the API server — connecting a coding assistant. This is not a full-fledged guide to vibe-coding, so we will only briefly look at how to connect LM Studio to Continue — a wrapper plugin for integrating LLMs into Visual Studio Code.
Install the
Continueplugin from the Marketplace.
In LM Studio, enable Developer Mode and start the API server. A message about the server starting should appear in the console.

In the Continue settings, find Models → + New Assistant. In the config.yaml that opens, add the model settings:

Example settings. The model name must match the exact ID in LM Studio.
name: Local Assistant version: 1.0.0 schema: v1 models: - name: Qwen LM Studio provider: openai model: qwen/qwen2.5-coder-14b apiBase:http://localhost:1234/v1apiKey: "" roles: - chat - edit - apply context: - provider: code - provider: docs - provider: diff - provider: terminal - provider: problems - provider: folder - provider: codebase
Now our code assistant works locally — and for free.

And if you're one of those Samsung engineers who previously sent confidential code to an external server — now your boss will be pleased with you!
In future tutorials, we will look at Ollama and more extensive configuration of AI assistants for coding.