www.computerweekly.com
Over the past few months, the editorial team at Computer Weeklys French sister title, LeMagIT, has been evaluating different versions of several free downloadable large language models (LLMs) on personal machines. These LLMs currently include Google's Gemma 3, Meta's Llama 3.3, Anthropic's Claude 3.7 Sonnet, several versions of Mistral (Mistral, Mistral Small 3.1, Mistral Nemo, Mixtral), IBM's Granite 3.2, Alibaba's Qwen 2.5, and DeepSeek R1, which is primarily a reasoning overlay on top of distilled versions of Qwen or Llama.The test protocol consists of trying to transform interviews recorded by journalists during their reporting into articles that can be published directly on LeMagIT. What follows is the LeMagIT teams experiences:We are assessing the technical feasibility of doing this on a personal machine and the quality of the output with the resources available. Let's make it clear from the outset that we have never yet managed to get an AI to work properly for us. The only point of this exercise is to understand the real possibilities of AI based on a concrete case.Our test protocol is a prompt that includes 1,500 tokens (6,000 characters, or two magazine pages) to explain to the AI how to write an article, plus an average of 11,000 tokens for the transcription of an interview lasting around 45 minutes. Such a prompt is generally too heavy to fit into the free window of an online AI. That's why it's a good idea to download an AI onto a personal machine, since the processing remains free, whatever its size.The protocol is launched from the LM Studio community software, which mimics the online chatbot interface on the personal computer. LM Studio has a function for downloading LLMs directly. However, all the LLMs that can be downloaded free of charge are available on the Hugging Face website.Technically, the quality of the result depends on the amount of memory used by the AI. At the time of writing, the best result is achieved with an LLM of 27 billion parameters encoded on 8 bits (Google's Gemma, in the "27B Q8_0" version), with a context window of 32,000 tokens and a prompt length of 15,000 tokens, on a Mac with SOC M1 Max and 64 GB of RAM, with 48 GB shared between the processor cores (orchestration), the GPU cores (vector acceleration for searching for answers) and the NPU cores (matrix acceleration for understanding input data).In this configuration, the processing speed is 6.82 tokens/second. The only way to speed up processing without damaging the result is to opt for an SOC with a higher GHz frequency, or with more processing cores.In this configuration, LLMs with more parameters (32 billion, 70billion, etc) exceed memory capacity and either don't even load, or generate truncated results (a single-paragraph article, for example). With fewer parameters, they use less memory and the quality of writing falls dramatically, with repetitions and unclear information. Using parameters encoded on fewer bits (3, 4, 5 or 6) significantly speeds up processing, but also reduces the quality of writing, with grammatical errors and even invented words.Finally, the size of the prompt window in tokens depends on the size of the data to be supplied to the AI. It is non-negotiable. If this size saturates memory, then you should opt for an LLM with fewer parameters, which will free up RAM to the detriment of the quality of the final result.Our tests have resulted in articles that are well written. They have an angle, a coherent chronology of several thematic sections, quotations in the right place, a dynamic headline and concluding sentence. Regardless of the LLM used, the AI is incapable of correctly prioritising the points discussed during the interview However, we have never managed to obtain a publishable article. Regardless of the LLM used, including DeepSeek R1 and its supposed reasoning abilities, the AI is systematically incapable of correctly prioritising the various points discussed during the interview. It always misses the point and often generates pretty but uninteresting articles. Occasionally, it will write an entire, well-argued speech to tell its readers that the company interviewed... has competitors.LLMs are not all equal in the vocabulary and writing style they choose. At the time of writing, Meta's Llama 3.x is producing sentences that are difficult to read, while Mistral and, to a lesser extent, Gemma have a tendency to write like marketing agencies, using flattering adjectives but devoid of concrete information.Surprisingly, the LLM that writes most beautifully in French within the limits of the test configuration is Chinese Qwen. Initially, the most competent LLM on our test platform was Mixtral 8x7B (with an x instead of an s), which mixes eight thematic LLMs, each with just 7 billion parameters.However, the best options for fitting Qwen and Mixtral into the 48GB of our test configuration are, for the former, a version with only 14 billion parameters and, for the latter, parameters encoded on 3 bits. The former writes unclear and uninteresting information, even when mixed with DeepSeek R1 (DeepSeek R1 is only available as a distilled version of another LLM, either Qwen or Llama). The latter is riddled with syntax errors.The version of Mixtral with parameters encoded on 4 bits offered an interesting compromise, but recent developments in LM Studio, with a larger memory footprint, prevent the AI from working properly. Mixtral 8x7B Q4_K_M now produces truncated results.An interesting alternative to Mixtral is the very recent Mistral Small 3.1 with 24 billion parameters encoded on 8 bits, which, according to our tests, produces a result of a quality fairly close to Gemma 3. What's more, it is slightly faster, with a speed of 8.65 tokens per second.According to the specialists interviewed by LeMagIT, the hardware architecture most likely to support the work of generative AI on a personal machine is one where the same RAM is accessible to all types of computing cores at the same time. In practice, this means using a machine based on a system-on-chip (SoC) processor where the CPU, GPU and NPU cores are connected together to the same physical and logical access to the RAM, with data located at the same addresses for all the circuits.When this is not the case that is, when the personal machine has an external GPU with its own memory, or when the processor is indeed a SoC that integrates the CPU, GPU and NPU cores, but where each has access to a dedicated part in the common RAM - then the LLMs need more memory to function. This is because the same data needs to be replicated in each part dedicated to the circuits.So, while it is indeed possible to run an LLM with 27 billion parameters encoded in 8 bits on a Silicon M Mac with 48 GB of shared RAM, using the same evaluation criteria, we would have to make do with an LLM with 13 billion parameters on a PC where a total of 48 GB of RAM would be divided between 24 GB of RAM for the processor and 24 GB of RAM for the graphics card.This explains the initial success of Silicon M-based Macs for running LLMs locally, as this chip is a SoC where all the circuits benefit from UMA (unified memory architecture) access. In early 2025, AMD imitated this architecture in its Ryzen AI Max SoC range. At the time of writing, Intel's Core Ultra SoCs, which combine CPU, GPU and NPU, do not have such unified memory access.Writing the prompt that explains how to write a particular type of article is an engineering job. The trick to getting off to a good start is to give the AI a piece of work that has already been done by a human - in our case, a final article accompanied by a transcript of the interview - and ask what prompt it should have been given to do the same job. Around five very different examples are enough to determine the essential points of the prompt to be written, for a particular type of article. The trick is to give the AI a piece of work that has already been done by a human and ask what prompt it should have been given to do the same job However, AI systematically produces prompts that are too short, which will never be enough to write a full article. So the job is to use the leads it gives us and back them up with all the business knowledge we can muster.Note that the more pleasantly the prompt is written, the less precisely the AI understands what is being said in certain sentences. To avoid this bias, avoid pronouns as much as possible ("he", "this", "that", etc) and repeat the subject each time ("the article", "the article", "the article"...). This will make the prompt even harder to read for a human, but more effective for AI.Ensuring that the AI has sufficient latitude to produce varied content each time is a matter of trial and error. Despite our best efforts, all the articles produced by our test protocol have a family resemblance. It would be an effort to synthesise the full range of human creativity in the form of different competing prompts.Within the framework of our test protocol and in the context of AI capabilities at the time of writing, it is illusory to think that an AI would be capable of determining on its own the degree of relevance of all the comments made during an interview. Trying to get it to write a relevant article therefore necessarily involves a preliminary stage of stripping down the transcript of the interview.In practice, stripping the transcript of an interview of all the elements that are unnecessary for the final article, without however eliminating elements of context that have no place in the final article, but which guide the AI towards better results, requires the transcript to be rewritten. This rewriting costs human time, to the benefit of the AI's work, but not to the benefit of the journalist's work.This is a very important point - from that point onwards, AI stops saving the user time. As it stands, using AI means shifting work time from an existing task (writing the first draft of an article) to a new task (preparing data before delivering it to an AI).Secondly, the description in 1,500 tokens of the outline to follow when writing an article only works for a particular type of article. In other words, you need to write one outline for articles about a startup proposing an innovation, a completely different outline for those about a supplier launching a new version of its product, yet another outline for a player setting out a new strategic direction, and so on. The more use cases there are, the longer the upstream engineering work will take.Worse still, to date our experiments have only involved writing articles based on a single interview, usually at press conferences, so in a context where the interviewee has already structured his or her comments before delivering them. In other words, after more than six months of experimentation, we are still only at the simplest stage. We have not yet been able to invest time in more complex scenarios, which are nevertheless the daily lot of LeMagIT's production, starting with articles written on the basis of several interviews.The paradox is as follows - for AI to relieve a user of some of their work, that user has to work more. On the other hand, on these issues, AI on a personal machine is on a par with paid AI online.Read more about using LLMsGoogle claims AI advances with Gemini LLM - Code analysis, understanding large volumes of text and translating a language by learning from one read of a book are among the breakthroughs of Gemini 1.5.Prompt engineering is not for dummies - This is a guest post written by Sascha Heyer in his capacity as senior machine learning engineer at DoiT International and oversees machine learning.What developers need to know about Large Language Models - A developer strolls casually into work and gets comfy in their cubicle. Suddenly theres an update alert on the laptop screen - a new generative artificial intelligence function has been released.