Ollama from A to Z: How to Choose a Model, Configure, and Integrate / Habr

When we talk about using large language models (LLMs), most people immediately think of cloud services. But it's not always convenient or possible to work over the internet: sometimes privacy restrictions get in the way, sometimes it's the connection speed, and sometimes you just want more control over the process. It is for such tasks that Ollama exists — a tool that allows you to run modern language models locally, in just a couple of steps.

With it, you can easily download a model, configure it for your needs, and work with it directly on your computer, without depending on external servers. Ollama provides a simple interface both through the command line and via an API, making it convenient for both developers and those just starting to get acquainted with LLMs.

Installing Ollama

So, how do you install Ollama?!

The easiest way is to go to the official Ollama website and click the download button or go directly to the download page, after which we select our operating system and install the program.

Главная страница официального сайта Ollama — The main page of the official Ollama website

After installing Ollama, a familiar window appears on the screen, very similar to the ChatGPT interface. You can start working at this stage: writing prompts, getting answers, and experimenting with the model's capabilities. However, I still recommend first deciding which specific model you want to work with. The speed of operation, the quality of the answers, and how comfortable you will be using the system on your device all depend on this. I will discuss the choice of model in more detail in the next chapter

Пользовательский интерфейс — User interface

Choosing a Model

Choosing a model is perhaps the most important step before you start using Ollama to its full potential. Everything here depends on the tasks you want to solve and the resources your computer has.

Graphics Card Capabilities

First and foremost, when choosing a model, you should pay attention to your hardware, specifically your graphics card. The main parameter here is VRAM (Video Random Access Memory). Essentially, it's the same as RAM, but it's dedicated to storing data that the graphics processor works with. And a simple rule applies here: the more, the better.

When you run an LLM, the model itself is fully loaded into the video memory. That is, if the model's weight is 7 GB, it will take up exactly that much VRAM. There is no "on-the-fly compression" or partial loading here: the entire model must fit into the graphics card's memory. Therefore, owners of cards with 4–6 GB of VRAM should choose more compact options, while those with 16–24 GB or more can run much heavier and higher-quality models.

Model Formats

When you start choosing a model, you will immediately notice different suffixes next to the model, such as 7B, q4, or fp16. In fact, they hold the key: it is these parameters that determine whether the model can run on your computer and what the quality of its answers will be.

Let's start with the letter B. When they write 7B, 13B or 70B, they are talking about the number of parameters in the model—billions of numbers that it consists of. The more parameters, the theoretically "smarter" the model is, the better it understands context and provides detailed answers. But along with this, the hardware requirements also grow: only the most powerful graphics cards can handle a 70B model in its full format.

Now about the prefix q. This is an abbreviation for quantization — quantization. Quantized models (for example, q4 or q8) take up less video memory and run faster. This is because the model's weights are stored not in the usual 16 or 32 bits, but in a more compact form—4 or 8 bits. The price for this is a slight loss of precision: sometimes such models are slightly worse at logic or code generation. But in most everyday scenarios, the difference is almost unnoticeable, and the performance gain is huge.

It's also worth mentioning the formats FP32, FP16, BF16, and INT8. Here we are talking about the format in which the model stores its weights.

FP32 — this is the "gold standard" of precision, but also the heaviest option.
FP16 and BF16 — simplified formats with half precision. They significantly save memory and speed up work, with almost no loss of quality.
INT8 and other integer variants are even lighter and faster, but here the trade-off between speed and precision becomes more noticeable.

In the end, you get a kind of constructor. On one hand, there's the number of parameters (7B, 13B, 70B), and on the other, the storage format (fp16, fp32, int8) and quantization (q4, q8). By choosing a combination of these parameters, we balance the quality of the answers with how many resources our computer is willing to spend.

Types of Models

In reality, there are quite a few types of models, and each model is tailored for its own tasks. To generalize, models can be divided into 2 main groups: general-purpose models and specialized models.

General-Purpose Models

These are the most general neural networks. They are designed to work with text: supporting dialogue, answering questions, writing articles, generating ideas. Examples of such models are LLaMA, Mistral. They are well-suited for most everyday tasks where text-based intelligence is needed.

Specialized Models

There are models tailored for specific scenarios, for example:

CodeLlama — focused on programming and working with code. It is better at autocompletion, explaining algorithms, and fixing errors.
Gemma — a conversational model, similar in style to ChatGPT, excellent for chatbots and interactive communication.
LLaVA — a multimodal model that can work with images, describe pictures, and combine text and visual data.

Turbo, Embedding, Vision, Tools, Thinking

Ollama also uses special tags, so to speak, that denote the features of a model:

Turbo — an optimized version of the model that works faster and with less latency. Ideal if response speed is important.
Embedding — models that create vector representations of text. Used for search, recommendations, and analysis, not for text generation.
Vision — multimodal models capable of working with images: recognizing objects, describing scenes, combining visual and text data.
Tools — models that can interact with external tools: APIs, databases, scripts. Suitable for automating complex tasks.
Thinking — models with an emphasis on reasoning and logic. They handle mathematics, strategic planning, and complex chains of reasoning well.

Each type of model reflects its strengths and limits its weaknesses. Therefore, before choosing, it is important to determine what you need: text generation, code, data analysis, or working with images.

Personal Advice

When it comes to choosing a model for Ollama, I always advise looking first at the VRAM of your graphics card. The amount of video memory determines which models you can run comfortably and which ones only in a reduced or quantized form.

For low-end graphics cards with 4–6 GB of VRAM, I recommend choosing compact quantized models. A good option is 7B q4. It takes up little memory, works quickly, and allows you to experiment with text and dialogues without freezing.

If you have mid-range graphics cards with 8–12 GB of VRAM, you can confidently move on to 7B q8 or 13B q4. These are already "smarter" models that provide more accurate and detailed answers, without requiring super-powerful hardware.

For powerful graphics cards with 16–32 GB of VRAM, the possibilities are wide open. Here you can already use 13B fp16, and if you wish, even larger models like 30B q4/q8. Such models are excellent for complex tasks: code generation, analysis of large texts, or working with multimodal data.

The main thing I want to emphasize is: don't chase the largest number of parameters if your VRAM is limited. It's better to take a smaller model in a suitable format (q4/q8) than to try to run a 30B fp16 on a card with 8 GB—it simply won't work.

By following this rule, you can use Ollama as effectively as possible: the models will run smoothly, and the quality of the answers will remain decent.

Setup and Configuration

An important factor when using Ollama is its proper configuration

The first thing to pay attention to is where the models are downloaded. They weigh a lot: even compact options like 7B q4 take up several gigabytes, and if you want to experiment with 13B or 30B, free space will disappear very quickly. Therefore, if you have a second drive, especially a large HDD or SSD, it's better to configure Ollama right away so that all models are saved there. This will not affect the speed of operation, but your system drive will remain free.

The second important point is the context length. This term refers to the number of tokens that the model can consider in a single request. Simply put, it is the size of the memory within which the model understands what the conversation is about. The larger the context, the better it remembers previous replies, large documents, or long code. But there is a downside: increasing the context length requires more resources and slows down generation.

By default, most models have a limited context length (e.g., 2K or 4K tokens), but in Ollama you can run models with an extended context—8K, 16K, and even more. Here again, it all comes down to your graphics card: the larger the context, the more VRAM will be required.

Therefore, configuring Ollama comes down to two simple rules: store models on a separate drive and choose a reasonable context length for your tasks and resources. This way, you will get the most out of a local LLM without overloading your computer with unnecessary tasks.

Working via CLI

From this point on, the article contains information for more advanced users.

You can work with Ollama not only through the client application with its user-friendly interface but also through the command line. This method is especially convenient for developers and those who are used to managing tools via the terminal. In addition, the CLI allows you to quickly check the model's operation without extra windows and switching.

To see a list of all available commands, just type ollama into the command line, and a list of all available commands with descriptions will appear. And now we will look at them.

Список всех доступных команд — List of all available commands

List of commands:

ollama serve — starts Ollama as a service. This command is needed if you want to work via the API or use Ollama in conjunction with other applications.
ollama create <name> — creates a new model based on an existing one. For example, you can add your own instructions or fine-tune it for a specific task.
ollama show <model> — shows detailed information about a model: its size, parameters, and settings.
ollama run <model> — runs a model in interactive mode. The most popular command: after calling it, you can immediately communicate with the LLM directly in the terminal.
ollama stop <model> — stops a running model, freeing up resources.
ollama pull <model> — downloads a model from the repository. This is usually the first step before running: the model won't start without being downloaded.
ollama push <model> — pushes a model to the repository. Useful if you have made your own build and want to share it or use it on another device.
ollama list — shows all the models that are installed on your computer.
ollama ps — lists the running models. Convenient for seeing what is currently running.
ollama cp <source> <destination> — copies a model from one place to another, for example, for backup or transfer.
ollama rm <model> — deletes a model, freeing up disk space. Especially relevant given that models can weigh tens of gigabytes.
ollama help — displays help for all commands. A good reminder if you suddenly forget the syntax or name.
In practice, the most commonly used commands are run, pull, list, and stop. The other commands are useful when you are working with custom models, setting up a server, or managing a library of neural networks on your drive.

Working via API

If the command line is not enough for you and you want to integrate Ollama into your projects, then the API comes to the rescue. Essentially, you can run Ollama as a server and interact with it via HTTP requests. This opens up possibilities for integrations: from chatbots and assistants to text analysis directly within your applications.

Starting the server is very simple—just execute the command:

ollama serve

After this, Ollama starts listening on a local port (by default http://localhost:11434), and you can send requests to it.

For example, to generate text using the Mistral model, you just need to make a POST request to /api/generate:

POST http://localhost:11434/api/generate
{
  "model": "mistral",
  "prompt": "Напиши короткий рассказ о космосе"
}

In response, Ollama will return the text generated by the selected model. The coolest thing is that this approach is universal: you can use any programming language, be it Python, JavaScript, or C#, because communication happens via a standard HTTP request.

The basic capabilities of the API allow you to:

start and stop models;
generate text;
manage models (downloading, deleting, etc.).

It is the API that makes Ollama a truly flexible tool. Through it, you can not only talk to the model in the console but also create full-fledged applications, connect databases, or even build your own AI services.

We have gone through the entire journey together—from installing Ollama to working via the CLI and API. We've figured out what types of models exist, how they differ in format and size, what to look for when choosing, and how to properly configure the system to work as efficiently as possible

To sum up, Ollama is an excellent tool for those who want to truly experience how an LLM works. You control the choice of model, its configuration, and integration methods. And most importantly, you have the freedom to experiment and build your own projects based on local artificial intelligence.

With this, we conclude our introduction, but we don't stop here: the world of LLMs is developing incredibly fast, with new models, formats, and approaches emerging. Therefore, the main thing is to keep learning and trying.

If you are interested in the latest news from the world of technology and IT, as well as practical advice, I invite you to my Telegram channel. There I share current news from the IT world and useful materials that will help you always stay up to date with all the new events