Ollama - TONYLABS TECH CO., LTD.

What is ollama?

Ollama is a free, open-source tool that allows users to run large language models (LLMs) on their own computers. LLMs are AI programs that can understand and generate human-like text, code, and perform other analytical tasks.

Install

Linux

$ curl -fsSL https://ollama.com/install.sh | sh

macOS

Download it here

Windows

Download it here

Linux Manual Install

$ curl -L https://ollama.com/download/ollama-linux-amd64.tgz \
-o ollama-linux-amd64.tgz

Extracts the contents of the ollama-linux-amd64.tgz and places the extracted files into /usr:

$ sudo tar -C /usr -xzf ollama-linux-amd64.tgz

Run Ollama for tests:

$ ollama serve

Open another terminal and verify that Ollama is running:

$ ollama -v

Make Ollama as a startup service

Create a user and group for Ollama:

$ sudo useradd -r -s /bin/false -U -m -d /usr/share/ollama ollama
$ sudo usermod -a -G ollama $(whoami)

Create a service file:

$ sudo nano /etc/systemd/system/ollama.service

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="PATH=$PATH"

[Install]
WantedBy=default.target

Then start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable ollama

Install CUDA

Checkout the instructions here

Customizing

$ sudo systemctl edit ollama

Alternatively, create an override file manually in /etc/systemd/system/ollama.service.d/override.conf:

[Service]
Environment="OLLAMA_DEBUG=1"

Updating

Update Ollama by running the install script again:

$ curl -fsSL https://ollama.com/install.sh | sh

Or by re-downloading Ollama:

$ curl -L https://ollama.com/download/ollama-linux-amd64.tgz \
-o ollama-linux-amd64.tgz
$ sudo tar -C /usr -xzf ollama-linux-amd64.tgz

View Logs

journalctl -e -u ollama

Uninstall Manual Installation

Remove system service:

$ sudo systemctl stop ollama
$ sudo systemctl disable ollama
$ sudo rm /etc/systemd/system/ollama.service

Remove the ollama binary from your bin directory (either /usr/local/bin, /usr/bin, or /bin):

$ sudo rm $(which ollama)

Remove the downloaded models and Ollama service user and group:

$ sudo rm -r /usr/share/ollama
$ sudo userdel ollama
$ sudo groupdel ollama

Expose Ollama

macOS

Linux

Edit systemd service by calling sudo systemctl edit ollama.service

[Service]
Environment="OLLAMA_HOST=0.0.0.0"

Reload systemd by calling following commands:

$ sudo systemctl daemon-reload
$ sudo systemctl restart ollama

Windows

On Windows, Ollama inherits your user and system environment variables.

First Quit Ollama by clicking on it in the task bar.
Start the Settings (Windows 11) or Control Panel (Windows 10) application and search for environment variables.
Click on Edit environment variables for your account.
Edit or create a new variable for your user account for OLLAMA_HOST, OLLAMA_MODELS, etc.
Click OK/Apply to save.
Start the Ollama application from the Windows Start menu.

Nginx

server {
    listen 80;
    server_name example.com;  # Replace with your domain or IP
    location / {
        proxy_pass http://localhost:11434;
        proxy_set_header Host localhost:11434;
    }
}

Ngrok

Ollama can be accessed using a range of tools for tunneling tools. Run the following command for test purposes:

$ ngrok http 11434 --host-header="localhost:11434"

Alternatively, edit the Ngrok config file manually in ~/.config/ngrok/ngrok.yml or run the following command:

$ ngrok config edit

Grab your Ngrok token and free domain name and replace them in the config file:

version: 3
agent:
  authtoken: 2n*******************
tunnels:
  ollama:
    proto: http
    addr: 11434
    domain: example.ngrok.app
    request_header:
      add: ["Host: localhost:11434"]

Start your tunnel:

$ ngrok start ollama
# or
$ ngrok start --all

Run Ollama

$ ollama run llama3.2:3b
#or
$ ollama run mistral

If you want to monitor a Ollama geneartion performance:

$ ollama run llama3.2 --verbose

Ollama Verbose

API

Generate Embeddings

$ curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Why is the sky blue?"
}'

Multiple Text Inputs:

curl http://localhost:11434/api/embed -d '{
  "model": "all-minilm",
  "input": ["Why is the sky blue?", "Why is the grass green?"]
}'

FAQ

How can I tell if my model was loaded onto the GPU?

Use the ollama ps command to see what models are currently loaded into memory.

NAME	ID	SIZE	PROCESSOR	UNTIL
llama3:70b	bcfb190ca3a7	42 GB	100% GPU	4 minutes from now

How do I manage the maximum number of requests the Ollama server can queue?

If too many requests are sent to the server, it will respond with a 503 error indicating the server is overloaded. You can adjust how many requests may be queue by setting OLLAMA_MAX_QUEUE.

Where are models stored?

macOS: ~/.ollama/models
Linux: /usr/share/ollama/.ollama/models
Windows: C:\Users\%username%\.ollama\models