Skip to main content

Local LLMs are getting easier

There is increasing interest in using smaller large language models (LLMs), hosted locally instead accessed from cloud-based vendors such as OpenAI. My clients have been interested in these either from a cost point of view, or for data protection reasons (since no data goes to OpenAI or other vendors).

Although this has been done for a while from Python using (mainly) the excellent Hugging Face, new options have come available that makes this easier and more flexible, especially from other languages such as Go and Rust. Here are observations and tips on a few alternatives that I’ve been trying.

Ollama

My favourite has been Ollama, a very clean and easy to use open-source tool (written in Go!) that downloads a select number of LLMs, then runs them, making available an input line for executing prompts, as well as exposing an API that is similar to the one we are used to from OpenAI.

The tool can be downloaded from the web site, and is easy to install on Mac OS and Linux. Then, just start it by typing ollama serve in a separate window.

You first need to download one of the 50 or so supported models, listed here. These include Llama 3.1 and 3.0, several variants of Mistral, phi, and others. For example:

ollama pull llama3.1

Type ollama list to see a list of models that have been downloaded (there does not seem to be a command to list all available models, see the above linked web page for that).

As soon as a model is downloaded, run it with ollama run llama3.1 and it will start up, with an input prompt that allows you to enter prompts. Type ollama info to show information about the model, such as the number of parameters and context length.

It also exposes an OpenAI-compatible API on port 11434, with endpoints generate and chat, making this an easy option for calling the LLM for any language using a REST API call. For example, from Go:

package main

import (
	"encoding/json"
	"fmt"
	"io/ioutil"
	"net/http"
	"strings"
)

// Structure for a generate request
type Generate struct {
	Model  string `json:"model"`
	Prompt string `json:"prompt"`
}

// Structure of one token returned per line
type Token struct {
	Model     string `json:"model"`
	CreatedAt string `json:"created_at"`
	Response  string `json:"response"`
	Done      bool   `json:"done"`
}

func main() {

	// Parameters for the query
	prompt := "What is time?"
	model := "llama3"
	url := "http://localhost:11434/api/generate"

	// Formulate a request to generate response to prompt, as string
	msg := Generate{model, prompt}
	b, err := json.Marshal(msg)
	if err != nil {
		fmt.Println(err.Error())
		return
	}

	// Needs to be an io.Reader for the Post request
	data := strings.NewReader(string(b))

	// Make a POST request to the API
	response, err := http.Post(url, "application/json", data)
	if err != nil {
		fmt.Println(err.Error())
		return
	}

	// Retrieve response
	responseData, err := ioutil.ReadAll(response.Body)
	if err != nil {
		fmt.Println(err.Error())
		return
	}

	// Extract and show individual tokens, one per line
	lines := strings.Split(string(responseData), "\n")
	tokens := []string{}
	for _, l := range lines {

		// End of input
		if len(l) == 0 {
			break
		}

		// Parse JSON
		tkn := Token{}
		err := json.Unmarshal([]byte(l), &tkn)
		if err != nil {
			fmt.Println(err.Error())
			return
		}

		// Stop after final token
		if tkn.Done {
			break
		}

		// Add to list of tokens
		tokens = append(tokens, tkn.Response)
	}

	// Show result
	fmt.Println(tokens)
}

Running this returns the following rather fanciful answer:

What a profound and complex question!

Time is a fundamental concept in our experience of the universe, but its nature has puzzled philosophers, scientists, and scholars for centuries. Here's an attempt to provide a comprehensive answer:

**Philosophical Perspectives:**

1. **Classical view:** Time is an objective feature of the universe, a linear progression from past to present to future. This perspective assumes time is a fixed, one-way dimension that governs our experience.
2. **Relational view:** Time is relative and dependent on the observer's frame of reference. According to this perspective, time is not absolute, but rather an emergent property of physical laws and human perception.
3. **Subjective view:** Time is a subjective experience, shaped by individual experiences, emotions, and memories.

**Scientific Theories:**

1. **Newtonian Mechanics:** Time is an absolute concept, independent of the observer's frame of reference. According to Newton's laws, time flows at the same rate everywhere in the universe.
2. **Relativity Theory (Einstein):** Time and space are intertwined as a single entity called spacetime. The laws of physics are the same for all observers, regardless of their relative motion or position. Time can appear to pass differently depending on the observer's frame of reference (e.g., time dilation).
3. **Quantum Mechanics:** Time is not a fixed dimension but rather an emergent property of quantum systems. Quantum phenomena, such as wave functions and entanglement, operate outside our classical understanding of time.

**Paradigm Shifts:**

1. **Kaluza-Klein Theory:** Higher-dimensional theories propose that our universe has more than the four dimensions we experience (three spatial and one temporal). This could imply that time is not an absolute concept.
2. **Fractal Time:** The idea that time is fractal, with repeating patterns at different scales, challenges our classical understanding of time as a linear progression.

**Philosophical Implications:**

1. **Free will vs. determinism:** If time is relative or subjective, does this imply free will, or are events predetermined?
2. **The nature of change:** Does the relativity of time imply that change is an illusion, or that reality is fundamentally dynamic?

**Conclusion:**

Time is a multifaceted concept that has been debated and explored across various disciplines. While our understanding of time has evolved significantly, the fundamental nature of time remains a subject of ongoing research and philosophical inquiry.

What's your take on time? Do you have any questions or perspectives to share?

Ollama also has options for importing GGUF files, creating models with a built-in system prompt, and more. See the GitHub page for more.

LlamaFile

Another good alternative is LlamaFile, which provides an executable that contains the model inside it. To run this model, just download one of the models from the web site, make it executable, and run it directly.

This option exposes a web interface for exploring chats (at http://localhost:8080), as well as an API compatible with the OpenAI one.

This is an attractive way to explore local LLMs, but I have since found Ollama easier to use and it offers a broader range of models.

LlamaCPP

Most of the adaptions described above are derived from Llama.cpp, and amazing C++ program that loads and runs Llama and some other transformer models inside a single program. It exposes both a web interface and an API. A large selections of models have been ported to this option.

It is more fiddly than Ollama, because it requires you to separately obtain the LLM in GGUF format, and specify this on the command line when running it. Most GGUF models are available on Hugging Face, but it’s still an extra step with some hassle.

I’ll create a separate post describing the running of LlamaCPP here, but you can probably figure out from from the GitHub page linked above.

Candle

Another interesting and ambitious option implemented in Rust is Candle, which supports about 20 models, and due to its support by Hugging Face is well documented and supported.

I plan on creating a separate post for this as I explore it further.