Ollama

Ollama is an open‑source platform that can be used for various tasks including but not limited to running a model, launching coding agents/assistants, and is integrated into various IDEs & Editors.

Ollama server through a Batch Job

The following example Slurm script (myscript.sh) launches an ollama instance running the model gemma4:12b using 1 NVIDIA L40s GPU on the Falcon cluster. Specifications include a job duration limited to 1 day and the model listens on port 11434. You should adjust the settings to select the model, number of CPUs, number of GPUs, and account name.

#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=l40s_normal_q
#SBATCH --time=1-0:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=12
#SBATCH --gpus=1

module load ollama

ollama serve

Run the Slurm script using sbatch myscript.sh and monitor the status of the job using squeue. Once the job runs, please allow a few minutes for the model to spin up. Once the endpoint is ready, the log file will indicate:

time=2026-06-10T11:22:15.875-04:00 level=INFO source=routes.go:1981 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"

Once the Ollama server has started, there are various ways to run a model.

Ollama CLI
Ollama API (cURL, Python, OpenAI)

Ollama CLI

To use Ollama’s CLI, you will make a direct connect to the compute node that your batch job was allocated. For this example, the ollama server is running on fal036. To connect to fal036, you will make an ssh connection from a login node on ARC:

user@falcon2:~$ ssh fal036
user@fal036:~$ 

Once on the compute node, you can start a chat with the following commands:

user@fal036:~$ module load ollama
user@fal036:~$ ollama run gemma4:12b
>>> Send a message (/? for help)

Ollama API (cURL)

Once the Ollama server is running, its API is automatically available and can be accessed via curl to integrate into your applications and various workflows.

If you wish to connect software running on your computer to the LLM running on the compute node of the cluster or would like to query from anywhere within the Falcon cluster, you must run SSH port forwarding to redirect the network traffic from your computer (or falcon cluster) to the compute node via the login node. In this example, the node is fal036, the port is 11434.

If you are on the falcon cluster, use the following ssh port forwarding:

ssh -N -L 11434:localhost:11434 fal036

If you receive a warning like the following:

bind [127.0.0.1]:11434: Address already in use
channel_setup_fwd_listener_tcpip: cannot listen to port: 11434
Could not request local forwarding.

Then you must kill the process that is already running. To find the process, use the loginusage $USER to find the ssh PID, then kill using kill -9 PID. Once the process is stopped, then you can use the ssh port forwarding to another node.

If you are on your local computer, use the following ssh port forwarding that jumps through the login node of falcon2:

ssh -J user@falcon2.arc.vt.edu -L 11434:localhost:11434 fal036

If you have not downloading or run a certain Ollama model before, you must make an ssh connection to the compute node and use ollama to download/pull that model in your /home/user/.ollama directory. The previous ssh port forwarding might automatically open a terminal that is connected to the compute node.

user@falcon2:~$ ssh fal036
user@fal036:~$ module load ollama
user@fal036:~$ ollama pull gemma4:12b

Now you can use Ollama’s API to submit queries via localhost on your computer or from anywhere on the falcon cluster.

curl http://localhost:11434/api/chat \
  -d '{
    "model": "gemma4:12b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Python

Ollama also integrates with python. An example python script is below:

from ollama import chat

response = chat(
    model='gemma4:12b',
    messages=[{'role': 'user', 'content': 'Hello!'}],
)
print(response.message.content)

This python script must be run on the compute node and not the login node. We have a central python module on ARC or you can use virtual environments.

OpenAI API

OpenAI chat completion example. Run this from your computer after you have completed the ssh port forwarding:

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1/',
    api_key='ollama',  # required but ignored
)

chat_completion = client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'Say this is a test',
        }
    ],
    model='gemma4:12b',
)
print(chat_completion.choices[0].message.content)

Read more examples of about using Ollama with OpenAI API.