m4r71n 4 days ago

What is everyone using their local LLMs for primarily? Unless you have a beefy machine, you'll never approach the level of quality of proprietary models like Gemini or Claude, but I'm guessing these smaller models still have their use cases, just not sure what those are.

12
Rotundo 4 days ago

Not everyone is comfortable with sending their data and/or questions and prompts to an external party.

DennisP 4 days ago

Especially now that a court has ordered OpenAI to keep records of it all.

https://www.adweek.com/media/a-federal-judge-ordered-openai-...

barnabee 4 days ago

I generally try a local model first for most prompts. It's good enough surprisingly often (over 50% for sure). Every time I avoid using a cloud service is a win.

ativzzz 4 days ago

I think that the future of local LLMs is delegation. You give it a prompt and it very quickly identifies what should be used to solve the prompt.

Can it be solved locally with locally running MCPs? Or maybe it's a system API - like reading your calendar or checking your email. Otherwise it identifies the best cloud model and sends the prompt there.

Basically Siri if it was good

eddythompson80 3 days ago

I completely disagree. I don't see the current status quo fundamentally changing.

That idea makes so much sense on paper, but until you start implementing it that you realized why no one does it (including Siri). "Some tasks are complex and better suited for complex giant model, but small models are perfectly capable of running simple limited task" makes a ton of sense, but the component best equipped at evaluating that decision is the smarter component of your system. At which point, you might as well have had it run the task.

It's like assigning the intern to triage your work items.

When actually implementing the application with that approach, every time you encounter an "AI-miss" you would (understandably) blame the small model, and eventually give up and delegate yet-another-scnario to the cloud model.

Eventually you feel you're artificially handcuffing yourself compared to literally every body else trying to ship something utilizing a 1b model. You have the worst of all options, crappy model with lots of hiccups yet it's still (by far) the most resource intensive part of your application making the whole thing super heavy and you are delegating more and more to the cloud model.

The local LLM scenario is going to be entirely driven by privacy concerns (around which there is no option. It's not like an E2EE LLM API could exist) or cost concerns if you believe you can run it cheaper.

ninjha01 3 days ago

Doesn’t this ignore that some data may be privileged/too big to send to the cloud. Perhaps i have my health records in Apple Health and Kaiser Permanente. You can imagine it being okay to be accessed locally, but not sent up to the cloud

eddythompson80 3 days ago

I’m confused. Your Apple Health or Kaiser Permanente data is already stored on the cloud. It’s not like it’s only ever store locally and if you lost your phone you lost your Apple Health or Kaiser Permanente data.

I already mentioned privacy being the only real concern, but it won’t be really the end user privacy. At least that particular concern isn’t the ball mover people’s comments here would make you think it is. Plenty of people are storing their medical information in Google drives and Gmail attachments already. If end user privacy from “the cloud” was actually a thing, you would have seen that reflected in the market.

The privacy concerns that are of importance are that of organizations.

diggan 4 days ago

I'm currently experimenting with Devstral for my own local coding agent I've slowly built together. It's in many ways nicer than Codex in that 1) full access to my hardware so can start VMs, make network requests and everything else I can do, which Codex cannot and 2) it's way faster both in initial setup, working through things and creating a patch.

Of course, it still isn't at the same level as Codex itself, the model Codex is using is just way better so of course it'll get better results. But Devstral (as I currently use it) is able to make smaller changes and refactors, and I think if I evolve the software a bit more, can start making larger changes too.

brandall10 4 days ago

Why are you comparing it to Codex and not Claude Code, which can do all those things?

And why not just use Openhands, which it was designed around which I presume can also do all those things?

moffkalast 4 days ago

> unless you have a beefy machine

The average person in r/locallama has a machine that would make r/pcmasterrace users blush.

rollcat 4 days ago

An Apple M1 is decent enough for LMs. My friend wondered why I got so excited about it when it came out five years ago. It wasn't that it was particularly powerful - it's decent. What it did was to set a new bar for "low end".

vel0city 4 days ago

A new Mac is easily starting around $1k and quickly goes up from there if you want a storage or RAM upgrade, especially for enough memory to really run some local models. Insane that a $1,000 computer is called "decent" and "low end". My daily driver personal laptop brand new was $300.

evilduck 4 days ago

An M1 Mac is about 5 years old at this point and can be had for far less than a grand.

A brand new Mac Mini M4 is only $499.

vel0city 4 days ago

Ah, I was focusing on the laptops, my bad. But still its more than $499. Just looked on the Apple store website, Mac Mini M4 starting at $599 (not $499), with only 256GB of storage.

https://www.apple.com/shop/buy-mac/mac-mini/m4

nickthegreek 4 days ago

microcenter routinely sells that system for $450.

https://www.microcenter.com/product/688173/Mac_mini_MU9D3LL-...

moffkalast 4 days ago

That's fun to hear given that low end laptops are now $800, mid range is like $1.5k and upper end is $3k+ even for non-Apple vendors. Inflation makes fools of us all.

vel0city 4 days ago

Low end laptops can still easily be found for far less than $800.

https://www.microcenter.com/product/676305/acer-aspire-3-a31...

Mr-Frog 4 days ago

The first IBM PC in 1981 cost $1,565, which is comparable to $5,500 after inflation.

rollcat 4 days ago

Of course it depends on what you consider "low end" - it's relative to your expectations. I have a G4 TiBook, the definition of a high-end laptop, by 2002 standards. If you consider a $300 laptop a good daily driver, I'll one-up you with this: <https://www.chrisfenton.com/diy-laptop-v2/>

vel0city 4 days ago

My $300 laptop is a few years old. It has a Ryzen 3 3200U CPU, it has a 14" 1080p display, backlit keyboard. It came with 8GB of RAM and a 128GB SSD, I upgraded to 16GB from RAM acquired by a dumpster dive and a 256GB SSD for like $10 on clearance at Microcenter. I upgraded the WiFi to an Intel AX210 6e for about another $10 off Amazon. It gets 6-8 hours of battery life doing browsing and texting editing kind of workloads.

The only thing that is itching me to get a new machine is it needs a 19V power supply. Luckily it's a pretty common barrel size, I already had several power cables laying around that work just fine. I'd prefer to just have all my portable devices to run off USB-C though.

pram 4 days ago

I know I speak for everyone that your dumpster laptop is very impressive, give yourself a big pat on the back. You deserve it.

fennecfoxy 4 days ago

You're right - memory size and then bandwidth is imperative for LLMs. Apple currently lacks great memory bandwidth with their unified memory. But it's not a bad option if you can find one for a good price. The prices for new are just bonkers.

drillsteps5 4 days ago

I avoid using cloud whenever I can on principle. For instance, OpenAI recently indicated that they are working on some social network-like service for ChatGPT users to share their chats.

Running it locally helps me understand how these things work under the hood, which raises my value on the job market. I also play with various ideas which have LLM on the backend (think LLM-powered Web search, agents, things of that nature), I don't have to pay cloud providers, and I already had a gaming rig when LLaMa was released.

ijk 4 days ago

General local inference strengths:

- Experiments with inference-level control; can't do the Outlines / Instructor stuff with most API services, can't do the advanced sampling strategies, etc. (They're catching up but they're 12 months behind what you can do locally.)

- Small, fast, finetuned models; _if you know what your domain is sufficiently to train a model you can outperform everything else_. General models usually win, if only due to ease of prompt engineering, but not always.

- Control over which model is being run. Some drift is inevitable as your real-world data changes, but when your model is also changing underneath you it can be harder to build something sustainable.

- More control over costs; this is the classic on-prem versus cloud decision. Most cases you just want to pay for the cloud...but we're not in ZIRP anymore and having a predictable power bill can trump sudden unpredictable API bills.

In general, the move to cloud services was originally a cynical OpenAI move to keep GPT-3 locked away. They've built up a bunch of reasons to prefer the in-cloud models (heavily subsidized fast inference, the biggest and most cutting edge models, etc.) so if you need the latest and greatest right now and are willing to pay, it's probably the right business move for most businesses.

This is likely to change as we get models that can reasonably run on edge devices; right now it's hard to build an app or a video game that incidentally uses LLM tech because user revenue is unlikely to exceed inference costs without a lot of careful planning or a subscription. Not impossible, but definitely adds business challenges. Small models running on end-user devices opens up an entirely new level of applications in terms of cost-effectiveness.

If you need the right answer, sometimes only the biggest cloud API model is acceptable. If you've got some wiggle room on accuracy and can live with sometimes getting a substandard response, then you've got a lot more options. The trick is that the things that an LLM is best at are always going to be things where less than five nines of reliability are acceptable, so even though the biggest models have more reliability, an average there are many tasks where you might be just fine with a small fast model that you have more control over.

teleforce 4 days ago

This is an excellent example of local LLM application [1].

It's an AI-driven chat system designed to support students in the Introduction to Computing course (ECE 120) at UIUC, offering assistance with course content, homework, or troubleshooting common problems.

It serves as an educational aid integrated into the course’s learning environment using UIUC Illinois Chat system [2].

Personally I've found it's really useful that it provides the details portions of course study materials for examples slides that's directly related to the discussions so the students can check the sources veracity of the answers provided by the LLM.

It seems to me that RAG is the killer feature for local LLM [3]. It directly addressed the main pain point of LLM hallucinations and help LLMs stick to the facts.

[1] Introduction to Computing course (ECE 120) Chatbot:

https://www.uiuc.chat/ece120/chat

[2] UIUC Illinois Chat:

https://uiuc.chat/

[3] Retrieval-augmented generation [RAG]:

https://en.wikipedia.org/wiki/Retrieval-augmented_generation

staticcaucasian 4 days ago

Does this actually need to be local? Since the chat bot is open to the public and I assume the course material used for RAG all on this page (https://canvas.illinois.edu/courses/54315/pages/exam-schedul...) all stays freely accessible - I clicked a few links without being a student - I assume a pre-prompted larger non-local LLM would outperform the local instance. Though, you can imagine an equivalent course with all of its content ACL-gated/'paywalled' could benefit from local RAG, I guess.

ozim 4 days ago

You still can get decent stuff out of local ones.

Mostly I use it for testing tools and integrations via API not to spend money on subscriptions. When I see something working I switch it to proprietary one to get best results.

nomel 4 days ago

If you're comfortable with the API, all the services provide pay-as-you-go API access that can be much cheaper. I've tried local, but the time cost of getting it to spit out something reasonable wasn't worth the literal pennies the answers from the flagship would cost.

qingcharles 4 days ago

This. The APIs are so cheap and they are up and running right now with 10x better quality output. Unless whatever you are doing is Totally Top Secret or completely nefarious, then send your prompts to an API.

ozim 3 days ago

I don’t see too much time spent to respond. I have above average hardware but nothing ultra fancy and I get decent response times from something like LLAMA 3.x. Maybe I am just happy with not instant replies but from online models O do not get replies much faster.

nomel 1 day ago

> but from online models O do not get replies much faster.

My point is that the raw token/second isn't all that matters. The tokens/second required for the correct/acceptable quality result is what actually matters. From my experience, the large LLM will almost always one shot an answers that takes many back-and-forth iterations/revisions from LLAMA 3.x. With higher reasoning tasks, you might spend many iterations only to realize the small model isn't capable of providing an answer, but the large model could after a few iterations. That wasted time is usually only pennies, if you would have just started with the large model.

Of course, it matters what you're actually doing.

qingcharles 4 days ago

If you look on localllama you'll see most of the people there are really just trying to do NSFW or other questionable or unethical things with it.

The stuff you can run on reasonable home hardware (e.g. a single GPU) isn't going to blow your mind. You can get pretty close to GPT3.5, but it'll feel dated and clunky compared to what you're used to.

Unless you have already spent big $$ on a GPU for gaming, I really don't think buying GPUs for home makes sense, considering the hardware and running costs, when you can go to a site like vast.ai and borrow one for an insanely cheap amount to try it out. You'll probably get bored and be glad you didn't spend your kids' college fund on a rack of H100s.

kbelder 3 days ago

There's some other reasons to run local LLMs. If it's on my PC, I can preload the context with, say, information about all the members of my family. Their birthdays, hobbies, favorite things. I can load in my schedule, businesses I frequent. I can connect it to local databases on my machine. All sorts of things that can make it a useful assistant, but that I would never upload into a cloud service.

mixmastamyk 4 days ago

Shouldn't the (MoE) mixture of experts approach allow one to conserve memory by working on specific problem type at a time?

> (MoE) divides an AI model into separate sub-networks (or "experts"), each specializing in a subset of the input data, to jointly perform a task.

ijk 4 days ago

Sort of, but the "experts" aren't easily divisible in a conceptually interpretable way so the naive understanding of MoE is misleading.

What you typically end up with in memory constrained environments is that the core shared layers are in fast memory (VRAM, ideally) and the rest are in slower memory (system RAM or even a fast SSD).

MoE models are typically very shallow-but-wide in comparison with the dense models, so they end up being faster than an equivalent dense model, because they're ultimately running through fewer layers each token.

cratermoon 4 days ago

I have a large repository of notes, article drafts, and commonplace book-type stuff. I experimented a year or so ago with a system using RAG to "ask myself" what I have to say about various topics. (I suppose nowadays I would use MCP instead of RAG?) I was not especially impressed by the results with the models I was able to run: long-winded responses full of slop and repetition, irrelevant information pulled in from notes that had some semantically similar ideas, and such. I'm certainly not going to feed the contents of my private notebooks to any of the AI companies.

notfromhere 4 days ago

You'd still use RAG, just use MCP to more easily connect an LLM to your RAG pipeline

cratermoon 4 days ago

To clarify: what I was doing was first querying for the documents via a standard document database query and then feeding the best matching documents to the LLM. My understanding is that with MCP I'd delegate the document query from the LLM to the tool.

longtimelistnr 4 days ago

As a beginner, I also haven't had much luck with embedded vector queries either. Firstly, setting it up was a major pain in the ass and I couldn't even get it to ingest anything beyond .txt files. Second, maybe it was my AI system prompt or the lack of outside search capabilities but unless i was very specific with my query the response was essentially "can't find what youre looking for"

ChromaticPanic 3 days ago

What were you trying it in? With openwebui RAG pretty much worked out of the box.