Skip to content
← All posts
· 6 min read·Emre Yurtbay

Self-Hosting Hermes by Nous Research: vLLM, Tool Calling, and Self-Hosting in Practice

A practical guide to self-hosting Nous Research's Hermes models with vLLM, including a function-calling example and a lightweight Ollama alternative.

LLMSelf-HostingvLLMHermesNous ResearchFunction CallingOpen WeightsOllamaLlama 3.1Inferenz

If you want to integrate language models into your own applications without tying yourself to an external API provider, there's hardly any way around open-weight models. The Hermes series from Nous Research is an interesting choice here: fine-tuned models with strong capabilities in tool calling, structured JSON outputs, and steering via the system prompt. This article shows how to run Hermes yourself with vLLM in a production-like manner, including a function-calling example and a lean alternative via Ollama.

What Hermes Is

Hermes is a family of open-weight models that Nous Research fine-tunes on top of existing foundation models (fine-tuning, not its own pre-training). The currently relevant generation is Hermes 3, which builds on Meta Llama 3.1 and is available in several sizes, including 8B, 70B, and 405B. Older variants such as Hermes 2 Pro exist based on Llama 3 (8B) as well as Mistral (7B).

Characteristic of Hermes is its focus on:

  • Function and tool calling: reliably generating structured calls that your application can execute.
  • Structured outputs: stable JSON generation, well suited for machine processing.
  • Steerability via system prompt: the model follows role and format specifications comparatively consistently.

The weights are publicly available in the HuggingFace namespace NousResearch, for example under NousResearch/Hermes-3-Llama-3.1-8B.

License

Since Hermes 3 is based on Llama 3.1, Meta's Llama 3.1 Community License applies. It permits commercial use but contains conditions (including a threshold for very large user counts as well as naming and attribution requirements). Review the Llama 3.1 license before production use. "Open Weights" here means open weights under a license with conditions, not unrestricted open source.

Self-Hosting Options at a Glance

Path Suitability Note
vLLM Production, high throughput OpenAI-compatible API, native tool parsing
Ollama Local development, smaller hardware GGUF, quantized, very easy to get started
llama.cpp CPU/edge, maximum control Basis for many GGUF setups
TGI Alternative in the HF ecosystem Text Generation Inference

The recommended main path for server operation is vLLM: high throughput through continuous batching and PagedAttention, plus an interface compatible with the OpenAI API, so existing clients work without modification.

Roughly Estimating Hardware and VRAM

An exact VRAM figure depends on context length, batch size, and KV cache. As a rough order of magnitude in half precision (FP16/BF16): around two GB of VRAM per one billion parameters, plus a reserve for the KV cache.

  • 8B: runs on a single consumer or workstation GPU; quantized (4-bit) even on significantly smaller cards.
  • 70B: requires multiple GPUs or aggressive quantization.
  • 405B: multi-GPU cluster with tensor parallelism, not for single machines.

Quantization (such as 8-bit or 4-bit) reduces memory requirements considerably but can cost accuracy. For initial tests, the 8B variant is the pragmatic entry point.

Architecture

+---------------------+      OpenAI-kompatibel       +---------------------+
|     Eigene App      | ---------------------------> |   vLLM Inferenz-    |
|  (Backend/Client)   |   POST /v1/chat/completions  |       Server        |
|                     | <--------------------------- |                     |
+---------------------+        JSON / tool_calls     +---------------------+
        |    ^                                                  |
        |    |                                                  v
        |    |                                        +---------------------+
        |    |                                        |  GPU + Hermes-      |
        |    |                                        |       Modell        |
        |    |                                        +---------------------+
        |    |
        |    |   Tool-Calling-Schleife
        v    |
+---------------------+
|   Tool ausfuehren   |   1. Modell fordert Tool an (tool_calls)
|   (Funktion, API,   |   2. App fuehrt Funktion aus
|    DB, Suche ...)   |   3. Ergebnis als role=tool zurueck
+---------------------+   4. Modell formuliert finale Antwort

Starting vLLM

After installation (pip install vllm), you start the server. Important for Hermes is the dedicated tool parser hermes together with --enable-auto-tool-choice, so that vLLM correctly converts the tool calls generated by the model into the OpenAI format.

vllm serve NousResearch/Hermes-3-Llama-3.1-8B \
  --enable-auto-tool-choice \
  --tool-call-parser hermes \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192

The server then provides the familiar endpoints /chat/completions, /completions, and /models under http://localhost:8000/v1. You can find details on tool parsing in the vLLM documentation on tool calling.

OpenAI-Compatible Call

Since the API is compatible, you can use the official openai client and simply set base_url and a placeholder key.

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

resp = client.chat.completions.create(
    model="NousResearch/Hermes-3-Llama-3.1-8B",
    messages=[
        {"role": "system", "content": "Du antwortest praezise auf Deutsch."},
        {"role": "user", "content": "Nenne drei Vorteile von Self-Hosting."},
    ],
    temperature=0.3,
)

print(resp.choices[0].message.content)

A simple test without the SDK works with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NousResearch/Hermes-3-Llama-3.1-8B",
    "messages": [{"role": "user", "content": "Sag kurz Hallo."}]
  }'

Function-Calling Example

For tool calling, you pass a tools array. If the model requests a call, the response delivers tool_calls instead of content. Your application executes the function and returns the result as a message with role: "tool".

import json
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Liefert die aktuelle Temperatur fuer einen Ort.",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "Stadtname"}
            },
            "required": ["city"],
        },
    },
}]

messages = [{"role": "user", "content": "Wie warm ist es gerade in Hamburg?"}]

first = client.chat.completions.create(
    model="NousResearch/Hermes-3-Llama-3.1-8B",
    messages=messages,
    tools=tools,
    tool_choice="auto",
)

msg = first.choices[0].message

if msg.tool_calls:
    call = msg.tool_calls[0]
    args = json.loads(call.function.arguments)

    # Funktion in Ihrer App ausfuehren (hier nur simuliert)
    result = {"city": args["city"], "temp_c": 18}

    messages.append(msg)  # Assistenten-Nachricht mit tool_calls
    messages.append({
        "role": "tool",
        "tool_call_id": call.id,
        "content": json.dumps(result),
    })

    final = client.chat.completions.create(
        model="NousResearch/Hermes-3-Llama-3.1-8B",
        messages=messages,
        tools=tools,
    )
    print(final.choices[0].message.content)

The tool_calls object contains an id, the function name, and the arguments as a JSON string. This exact id must be referenced in the role: "tool" response as tool_call_id, so that the model can match the call with the result.

Lean Alternative: Ollama

For local experiments or smaller hardware, Ollama is the easiest path. It uses quantized GGUF models and comes with its own API. Available tags vary; check the library before pulling.

# Beispielhaft eine Hermes-Variante ziehen und starten
ollama run hermes3

# oder explizit eine Groesse
ollama run hermes3:8b

Ollama is well suited for trying things out, but for high parallel throughput in production, vLLM has a clear advantage.

Practical Recommendation

Start with Hermes-3-Llama-3.1-8B on a single GPU under vLLM. First verify the tool-calling loop with a simulated function before connecting real systems. Pay attention to a sensible --max-model-len (the KV cache is the most common VRAM bottleneck) and validate structured outputs server-side via a schema, rather than blindly trusting the model. Review the Llama 3.1 license before going into production. Only scale up to 70B or quantization once the 8B variant measurably fails to meet your quality requirements.

If you are planning a Hermes or vLLM setup or want to optimize existing LLM infrastructure, feel free to write to info@yurtbay.dev. I support you with architecture, tool-calling integration, and deployment.

Discuss your project