vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving. Users can use custom Slurm scripts to deploy their own vLLM instances. Models downloaded from Hugging Face are available at /common/data/models/. If you want us to download and deploy additional models, please submit a request via 4help. Users are required to accept the terms and conditions of the models using their Hugging Face account.

Tip

When serving models use the path to the model in /common rather than the name of the model. E.g. use vllm serve /common/data/models/openai--gpt-oss-120b rather than vllm serve openai/gpt-oss-120b. This will save you space in your home’s .cache/huggingface and prevent you from running out of storage quota.

Running your own LLM using vLLM

The following example Slurm script launches a vLLM instance running the model openai/gpt-oss-120b using 2 NVIDIA L40s GPUs on the Falcon cluster. Specifications include a job duration limited to 1 day, the model listens on port 8000, and the OpenAI API endpoint key is a3b91d38-6c74-4e56-b89f-3b2cfd728d1a. You should adjust the settings to select the model, number of GPUs, context length, port, and API key you need.

#!/bin/bash
#SBATCH --account=<account_name>
#SBATCH --partition=l40s_normal_q
#SBATCH --time=1-0:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --gres gpu:l40s:2
#SBATCH --output=gpt-oss-120b.log

module load vLLM

vllm serve /common/data/models/openai--gpt-oss-120b \
--served-model-name gpt-oss-120b \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--max-seq-len-to-capture 32768 \
--swap-space 16 \
--port 8000 \
--api-key a3b91d38-6c74-4e56-b89f-3b2cfd728d1a

Run the Slurm script using sbatch myscript.sh and monitor the status of the job using squeue. Once the job runs, please allow a few minutes for the model to spin up. Once the endpoint is ready, the log file will indicate:

INFO:     Started server process
INFO:     Waiting for application startup.
INFO:     Application startup complete.

At this point, you can use the OpenAI API to submit queries. Please note the compute node where the instance is running (see squeue). In this example, the node is fal036, the port is 8000, and the API key is a3b91d38-6c74-4e56-b89f-3b2cfd728d1a. Therefore, the following query will work from anywhere within the Falcon cluster.

curl -v http://fal036:8000/v1/completions  
-H "Content-Type: application/json"  
-H  "Authorization: Bearer a3b91d38-6c74-4e56-b89f-3b2cfd728d1a
-d '{
    "prompt": "Why is the sky blue?",
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'

If you wish to connect sofware running on your computer to the LLM running on the compute node of the cluster, you must run SSH port forwarding to redirect the network traffic from your computer to the compute node via the login node. For example:

ssh -N -L 8000:fal036:8000 user@falcon2.arc.vt.edu

Now you can use the OpenAI API to submit queries via localhost on your computer.

curl -v http://localhost:8000/v1/completions  
-H "Content-Type: application/json"  
-H  "Authorization: Bearer a3b91d38-6c74-4e56-b89f-3b2cfd728d1a
-d '{
    "prompt": "Why is the sky blue?",
    "max_tokens": 200,
    "temperature": 1,
    "top_p": 0.9,
    "seed": 10
  }'

OpenAI chat completion example. Run this from your computer:

import argparse

from openai import OpenAI

# Modify OpenAI's API key and API base to use the server.
openai_api_key = "a3b91d38-6c74-4e56-b89f-3b2cfd728d1a"
openai_api_base = "http://localhost:8000/v1"

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is Virginia Tech known for?"},
]


def parse_args():
    parser = argparse.ArgumentParser(description="Client for API server")
    parser.add_argument(
        "--stream", action="store_true", help="Enable streaming response"
    )
    return parser.parse_args()


def main(args):
    client = OpenAI(
        api_key=openai_api_key,
        base_url=openai_api_base,
    )

    models = client.models.list()
    model = models.data[0].id

    # Chat Completion API
    chat_completion = client.chat.completions.create(
        messages=messages,
        model=model,
        stream=args.stream,
    )

    print("-" * 50)
    print("Chat completion results:")
    if args.stream:
        for c in chat_completion:
            print(c)
    else:
        print(chat_completion)
    print("-" * 50)


if __name__ == "__main__":
    args = parse_args()
    main(args)