drillsteps5 4 days ago

I concur LocalLLama subreddit recommendation. Not in terms of choosing "the best model" but to answer questions, find guides, latest news and gossip, names of the tools, various models and how they stack against each other, etc.

There's no one "best" model, you just try a few and play with parameters and see which one fits your needs the best.

Since you're on HN, I'd recommend skipping Ollama and LMStudio. They might restrict access to the latest models and you typically only choose from the ones they tested with. And besides what kind of fun is this when you don't get to peek under the hood?

llamacpp can do a lot itself, and you can do most recently released models (when changes are needed they adjust literally within a few days). You can get models from huggingface obviously. I prefer GGUF format, saves me some memory (you can use lower quantization, I find most 6-bit somewhat satisfactory).

I find that the the size of the model's GGUF file with roughly tell me if it'll fit in my VRAM. For example 24Gb GGUF model will NOT fit in 16Gb, whereas 12Gb likely will. However, the more context you add the more RAM will be needed.

Keep in mind that models are trained with certain context window. If it has 8Kb context (like most older models do) and you load it with 32Kb context it won't be much help.

You can run llamacpp on Linux, Windows, or MacOS, you can get the binaries or compile on your local. It can split the model between VRAM and RAM (if the model doesn't fit in your 16Gb). It even has simple React front-end (llamacpp-server). The same module provides REST service which has similar (but simpler) protocol to OpenAI and all the other "big" guys.

Since it implements OpenAI REST API, it also works with a lot of front-end tools if you want more functionality (ie oobabooga aka textgeneration webui).

Koboldcpp is another backend you can try if you find llamacpp to be too raw (I believe it's the still llamacpp under the hood).

3
gavmor 3 days ago

Why skip ollama? I can pull any GGUF straight from HuggingFace eg:

`ollama run hf.co/unsloth/DeepSeek-R1-0528-GGUF:Q8_0`

lolinder 3 days ago

> Since you're on HN, I'd recommend skipping Ollama and LMStudio.

I disagree. With Ollama I can set up my desktop as an LLM server, interact with it over WiFi from any other device, and let Ollama switch seamlessly between models as I want to swap. Unless something has changed recently, with llama.cpp's CLI you still have to shut it down and restart it with a different command line flag in order to switch models even when run in server mode.

That kind of overhead gets in the way of experimentation and can also limit applications: there are some little apps I've built that rely on being able to quickly swap between a 1B and an 8B or 30B model by just changing the model parameter in the web request.

drillsteps5 3 days ago

llamacpp can set up REST server with OpenAI API so you can get many front-end LLM apps to talk to it the same way they talk to ChatGPT, Claude, etc. And you can connect to that machine from another one on the same network through whatever port you set it to. See llamacpp-server.

When you get Ollama to "switch seamlessly" between models it still simply reloads a different model with llamacpp which is what it's based on.

I prefer llamacpp because doing things "seamlessly" obscures the way things work behind the scenes, which is what I want to learn and play with.

Also, and I'm not sure if it's the case anymore but it used to be, when llamacpp gets adjusted to work with the latest model, sometimes it takes them a bit to update the Python API which is what Ollama is using. It was the case with one of the LlaMas, forget which one, where people said "oh yeah don't try this model with Ollama, they're waiting on llamacpp folks to update llama-cpp-python to bring the latest changes from llamacpp, and once they do, Ollama will bring the latest into their app and we'll be up and running. Be patient."

lolinder 3 days ago

> I prefer llamacpp because doing things "seamlessly" obscures the way things work behind the scenes, which is what I want to learn and play with.

And that's a fine choice, but some of us actually just want to hack with the models not hack on them. Ollama is great for that, and SSHing into my server to reboot the process every time I want to change models just doesn't work for me.

I'd rather wait a few weeks for the newest model and be able to alternate easily than stay on the bleeding edge and sacrifice that.

ChromaticPanic 3 days ago

Ollama has a really good perk in that it makes it trivial which model is loaded and unloaded from the GPU. So if you're using a frontend like librechat or openwebui, then switching models is as easy as picking from the drop down without having to fiddle with the command line.