Jun 4

AI honeymoon pricing is over, but your work is not

8 Comments

Sandeep

Jun 7

Good read! Typo maxInputTokens -> maxOutputokens

Reply (1)

Alex Ewerlöf

Jun 7

Thank you Sandeep. copy/paste error! Fixed :)

Reply (1)

Sandeep

Jun 7

Also, Gemma 12b qat model is available in LM studio now. I am working with an M4 Mac mini with 16 GB RAM (GPU/CPU given URAM on Apple Silicon).

When using VSCode Copilot as a harness, I am having a problem with Copilot’s layout engine choking before it even attempts to talk to LM Studio.

Trick I am doing to resolve this:

1. Click the model dropdown at the bottom of the chat panel (where it says Gemma 4 12B (chat-completions)).

2. Temporarily switch it back to a default cloud model (like GPT-4o or Gemini).

3.Type "hi" to confirm the chat panel cleans itself up and successfully renders.

4. Once it is working normally, switch the dropdown back to your custom Gemma 4 12B endpoint.

Not sure if it's just me or you faced something similar

andy

Jun 13Edited

you missed one of the pros in your summary of running local models for development workflows: no usage limits or per-token costs, ever.

andy

Jun 13

when I tried Q8_0 for K cache and Q4_0 V cache, Gemma 4 26B A4B QAT failed to load. I've been running Q8_0 for both K and V cache settings for a while now and it seems solid, and much less RAM usage than the default F16

Jeremy Wiersma

Jun 21

Hi Alex. I am trying to use the Gemma 12B QAT model via LM Studio with Github CoPilot but I cannot get it to maintain a conversation. I keep hitting an error when I try to kick off a coding session with a "Response too long" error. I have configured my model to have a 64k / 16k input/output token ratio similar to you, but the 12B QAT has a max output token limit of 2048 so I presume it's limited anyway. What I don't understand is what I have got setup wrong to not even maintain a conversation. The kind of error I see in Copilot is:

---

Sorry, your request failed. Please try again.

Client Request Id: d7c5a66c-d709-4e1a-8e53-5356dc89e601

Reason: Response too long.: Error: Response too long. at FG._provideLanguageModelResponse (c:\Program Files\Microsoft VS Code\fcf604774b\resources\app\extensions\copilot\dist\extension.js:1710:14094) at process.processTicksAndRejections (node:internal/process/task_queues:104:5) at async FG.provideLanguageModelResponse (c:\Program Files\Microsoft VS Code\fcf604774b\resources\app\extensions\copilot\dist\extension.js:1710:15097)

---

So far help from AI ironically has been - unhelpful. Any ideas you can suggest?

Reply (1)

Alex Ewerlöf

Jun 21

Hi Jeremy,

Copilot adds a massive system prompt and tools list (20-30k tokens) which wastes the limited local LLM context window.

Can we isolate the issue?

Do you have the possibly to try the model in Pi? Or can you programmatically send a REST request to the end point?

You said "12B QAT has a max output token limit of 2048". Where do you see that? To my knowledge it doesn't have a cap on output tokens. You mean your own config in VS Code?

Per

Jun 21

I was just looking for a write-up on how to set this up, so this was a nice read. I've been happily using Copilot up until the price hike, so I was quite annoyed when MS announced that. Luckily, I got a 7900 XTX a year ago, or so, which I've now put to good use! Thank you.

Using local LLMs for agentic coding