Run an AI on your Desktop PC?

Running ChatGPT costs around £14 monthly. Same with Claude Pro. Running your own LLM locally costs nothing monthly after the initial hardware splurge. Well ok, a bit of electricity but it’s negligible. Like many situations, caution is needed here if you are tempted to run your own AI: choosing the wrong LLM model will waste all that money you save. I spent weeks testing which quantized LLMs deliver great results in real world tests on humble hardware. The winner handles 95% of tasks just as well as paid alternatives, while the losers produce unusable rubbish. A waste of time and resources. So, here’s the complete guide to running your own LLM on a shoestring budget.

The Hardware Reality Check

A common view is that you need a £1,500 RTX 4090 to run decent LLMs locally, but that’s complete misinformation spread by people who haven’t actually tested budget setups. I’ve been running Mistral 7B and Llama models on humble hardware & the reality is that most people already have computers capable of running these models right now.

Graphics card manufacturers might possibly have created artificial barriers that don’t match actual performance data. Of course, they want you to buy super expensive flagship cards when in actual fact mid-range hardware delivers nearly identical results from the popular LLM options for local hosting. I found that an off the shelf RTX 3060 with 12GB VRAM performed within 15% of an RTX 4090 for text generation tasks, while costing around 80% less!

The actual minimum requirements reveal how misleading the marketing has been. 8GB VRAM performs surprisingly well with quantized models, handling complex conversations and code generation really well. Moving up to 12GB VRAM, this runs most popular models at full throttle with even a bit of room to spare!.

I found that a RTX 3060 12GB, runs Mistral 7B at 38 tokens per second. The RTX 4060 Ti pushes that to 42 tokens per second for a few pounds extra. Even older cards like the RTX 2070 Super with 8GB handle smaller models without issues.

The major factor to keep in mind if you’re going to self-host: VRAM matters much more than raw compute power for LLM performance. ‘Why’ I hear you say. It is because LLMs involve heavy matrix computations, which GPUs excel at due to their parallel processing capabilities. If you get a modern GPUs with high VRAM then it’s a ‘win win’. They are particularly good at handling large models. In the exciting new world of LLM local self-hosting, VRAM comes first, normal RAM second and CPU third!

Talking about CPU’s, now is probably a good time to mention ‘CPU fallback’. This is when parts of the LLM’s calculations are offloaded to the CPU when the GPU’s VRAM just isn’t enough to handle the entire task. This allows larger & more complex LLMs to run on systems with limited GPU/VRAM resources, (albeit at a slower rate). I ran Llama 7B on a basic Ryzen 5 processor and got 8 tokens per second. Yes I know this is slow, but at the same time it’s perfectly usable for research tasks and coding tasks. In these kinds of scenarios with active CPU fallback your traditional system RAM becomes the limiting (or transcending) factor here. I found that 32GB handles most models comfortably. Which reminds me, someday, in a distant utopia, I would love to specify my next desktop PC with 64gb of RAM!

The right LLM choice can make budget hardware perform like premium machines. Quantized versions of top-tier models run lightning fast on modest GPUs while maintaining 95% of their original quality. (Quantization is a technique that compresses a LLM, so it can run better on more modest hardware. Think of it like ‘zipping’ your LLM, compressing it and reducing fat to make it smaller!) In my own office I’ve been using a sub £300 RTX 3060 12GB setup for months, and it handles everything from marketing copy to creative writing to code debugging for me.

What’s the absolute minimum hardware you actually need to run LLMs locally without breaking the bank? An RTX 3060 with 12GB VRAM costs under £300 & will handle 95% of local LLM tasks perfectly. That card pays for itself in 15 months compared to subscription services.

But having capable hardware means nothing if you pick the wrong model to run on it.

The Great Model Showdown

I spent a really long time running identical prompts through Mistral 7B, Llama 2 7B, and Phi-2 on the same budget hardware, and discovered big performance differences. These are 3 of the most frequently recommended LLMs for self-hosting situations. I used the simple to use LM Studio on a PC here in the office with a 13th gen i5 processor and 32gb of RAM coupled with the RTX 3060 with 12GB VRAM. Rather than ‘simulating’ jobs, I focused on ‘real life’ tasks and in my office, it doesn’t take long for more work to come along. Let’s just say I had no shortage of work for these 3 test candidates to get stuck into!

If you enter the world of self-hosting LLMs you will encounter model ‘size’ and ‘parameter’. Like megapixels in a camera, you would assume that more is better, but this is not always the case. Some smaller LLMs completely destroy larger ones in certain areas. I watched Phi-2 with only 2.7 billion parameters generate better Python code than Llama 2 7B with it 7 billion parameters!

My text generation tests revealed dramatic differences in response speed, quality, and accuracy across different scenarios I presented to the models. Mistral 7B consistently delivered responses in 2.8 seconds for 200-word outputs, while Llama 2 7B took 4.1 seconds for identical prompts. But speed is no use if the output is low quality. Mistral maintained coherent narratives across 1000-word responses, while I could see that Llama 2 started repeating phrases and losing focus after around 600 words!

The coding performance comparison revealed the biggest difference between the models. I fed each them a series of identical programming challenges . Phi-2 generated working code 89% of the time with hardly any debugging needed. Mistral 7B produced functional code 72% of the time but required more troubleshooting. Llama 2 7B limped in last, generating poor broken syntax and logic errors that took ages to locate and then fix.

After this I tested the models ability to summarize, an increasingly popular real world request. There were big quality differences in terms of output clarity and accuracy. I used the same 6,500 word academic study for each model. Mistral 7B captured key points while maintaining context between sections. Llama 2 7B missed critical connections and produced summaries that felt strange and disjointed. Phi-2 struggled with this longer test document.

After running hundreds of prompts across all of these diverse areas, one model consistently delivered the best results across all categories. The others excelled in narrow areas but failed in others.

So, the results. How do quantized versions of Mistral 7B, Llama 2 7B, and Phi-2 actually compare on budget PC hardware? For me, Mistral 7B quantized to 4-bit wins overall performance with the best balance of speed and quality. Llama 2 7B excels at creative writing tasks but struggles with technical accuracy. Phi-2 dominates coding tasks despite being the smallest model tested. Congratulations Mistral 7B!

The GPU Price to Performance Matrix

The RTX 4090 costs maybe £1,500 and runs Mistral 7B at 45 tokens per second, while a £300 RTX 3060 12GB runs the same model at 38 tokens per second – that’s 84% of the performance for about a fifth of the price! This calculation speaks for itself. Without doubt the 3060 represents a ‘sweet spot’ a the minute in terms of price/performance. The performance gap for LLM is tiny compared to the massive price difference.

When considering options for graphics cards yourself, remember one thing, standard reviews there the use case is largely based around gaming will not be all that useful. The various specifications used by the manufacturing industry that drive GPU prices have very little correlation with local LLM performance. For example memory bandwidth matters far more than CUDA cores or stream processors, so essentially when in the marketplace for a LLM GPU, you need to take a lot of your preconceptions, and leave them at the door!

Marketing spec & blurb will completely mislead you if you are buying for LLM usage. The RTX 4070 cards with 12GB VRAM for example will perform identically to the RTX 3060 series of cards with the same memory, which would likely cost you less than half! The extra compute power can’t be used because LLM inference is largely bound by the available memory and no the fancy processing core groups.

RTX 3060 12GB cards sell for under £300 and handle any Mistral 7B model perfectly. RTX 4060 Ti could maybe cost around the £400 mark and offers slightly better efficiency but with identical performance. RTX 3080 cards with 10GB VRAM cost somewhere in between the 2 used but the lower VRAM means they hit memory limits with larger models, making them poor value despite higher compute power on paper!

When buying expensive technology its sensible to have one eye on ‘future-proofing’, but to in this one area which is moving and changing so quickly, I really wonder if there’s any point in even thinking about this. In 3 or 4 months the LLM territory will have changed substantially again, forcing all the fanatics, testers & tweakers to analyze and come out with fresh recommendations for best configurations. Having said that I would not personally buy a card right now with less than 12GB of VRAM handle current 7B models and upcoming 13B quantized versions. 8GB cards work today (just) but limit your options as models improve. 16GB cards like the RTX 4060 Ti 16GB might cost £450 but it should be able to run any consumer focused model for the next three years (famous last words?).

Lets drop another acronym in, TCO (total cost of ownership) includes power consumption so don’t forget to allow for that in your calculations. RTX 3060 cards draw 170 watts maximum, while RTX 4090 cards consume 450 watts under load. Yikes! Running models for hours daily creates substantial electricity costs that would push me towards efficient hardware over raw performance. Especially at this time when electricity costs are at or near all-time highs!

So factoring in ‘everything’, which GPUs give you the best performance per £ for local LLM hosting? The RTX 3060 12GB offers unbeatable value under £300, while the RTX 4060 Ti provides the sweet spot for users wanting maximum performance under £400. Both cards need about 160w-170w of electricity. Not too scary!

So does it make sense to buy your own local AI setup and ditch your subscriptions? A £300 RTX 3060 12GB running Mistral 7B quantized pays for itself in roughly 15 months compared to a ChatGPT Plus subscriptions, while delivering comparable results for 95% of tasks. The only weak point in this plan is the ‘shiny new feature’ effect. ChatGPT and others will be releasing better and better models with brilliant new capabilities at a fast pace. If you don’t have these components lying about to upgrade your desktop PC, then you will essentially be stuck in a situation of trying to use a possibly outdated LLM experience in order to justify and pay for your initial hardware outlay. There is a danger then that going down this route for financial reasons could actually trap you into a position of using older LLM technology for longer periods to claw back your financial investment. This could be a negative thing. With AI looking set to play a bigger and bigger role in all areas of our lives, it would be best to keep on top of the latest developments, not lagging behind.

But there are other reasons aside from financial to host your own. Privacy being the obvious one. Yes, we all love document summaries but what if you need a legal or medical document summarized or a document containing a client’s sensitive information? Uploading to cloud based mainstream LLMs creates a whole collection of privacy related potential problems for you that would largely be solved by hosting locally. Whatever your use case, if you’re thinking of proceeding with your own LLM, my final advice is: start with Mistral 7B 4-bit on whatever 8GB+ VRAM card you can afford, then upgrade your model selection before upgrading hardware. Local LLMs aren’t perfect replacements for cloud services, but they have reached a point where they are finally good enough to allow you to save some money while maintaining your privacy and data control.

Locally hosted LLMs. Great for privacy and cost saving!

One final massive caveat to everything that has just been said. I realize that AI is one of the fastest, if not the fastest area of technological innovation. We all know how much things have massively changed in the last year and the rate of change, if anything appears to be accelerating. It is highly likely that within a couple of months, much of the detail in this article will be out of date and not all that accurate. So be it. Theres no point in worrying about something we can’t control, and I think for sure that AI is like the tide or the wind, it cannot be held back! Maybe we will revisit this topic in 6 months or a years’ time and see if a 3060 graphics card is still up to the task…