What is Ollama

Ollama is an open-source platform that runs open LLMs on your local machine (or a server). It acts like a bridge between any open LLM and your machine, not only running them but also providing an API layer on top of them so that another application or service can use them.

Tweaking GenAI LLM locally using Ollama

This guide provides essential commands and environment tweaks for optimizing Ollama when running complex LLM models locally.

Prerequisites

If you don't have Ollama installed, go to https://ollama.com and download the installation package:

Running a Model

Find your desired model on https://ollama.com in this example we are going to use popular DeepSeek model:

To run our chosen model with Ollama open your terminal and run following command:

ollama run deepseek-r1:14b

Tweaking Model Parameters

Ollama defaults model context size is set to 2048. To change this, you'll need to create a 'Modelfile' and include the following content

FROM deepseek-r1:14b
 
PARAMETER num_ctx 32768
PARAMETER num_predict -1

In this file, num_ctx is set to the maximum context size of the model, and num_predict is set to -1, which allows the model to use the remaining context as output context.

After creating the file, run the following command:

ollama create deepseek-r1-14b-32k-context -f ./Modelfile

Now, you can use the entire context of the model without truncation by calling:

ollama run deepseek-r1-14b-32k-context

Performance Optimization

Set environment variables to improve performance (maximize token generation speed):

export OLLAMA_FLASH_ATTENTION=true 
export OLLAMA_KV_CACHE_TYPE=f16

Available `OLLAMA_KV_CACHE_TYPE` Values

q8_0
q4_0
f16 (for 16-bit floating point caching)

These settings can help optimize memory usage and speed when running large models locally.