Phase 1: The "Heavyweight" Disaster with "Qwen 3.5 Coder Next"
I started ambitious. I pulled Qwen 3.5 Coder Next. I thought 32GB of RAM would be enough. I was wrong.
The Experience:
Initial 'Hi': 25 seconds.
Coding Task: 7+ minutes to generate just some initial instructions and start writing the variables section for an Arduino script.
The Culprit: My logs showed the model needed 51.3 GiB of memory. Since I only have 32GB, Ollama had to shove 26GB of the "brain" onto my CPU.
Phase 2: The "Thinking" Trap with "Qwen 3.5 9B"
I pivoted to a smaller model: Qwen 3.5 9B. On paper, this should have been lightning fast. However, I ran into a new hurdle: Reasoning Loops.
Even with the smaller 9B model, the "Reasoning/Thinking" phase was taking forever—sometimes up to 7 minutes of "thinking" without a single line of code being written. At one point, it even got caught in a logical loop, and I had to restart the process.
Phase 3: The Breakthrough (The "Nothink" Secret)
The real "Aha!" moment came when I realized I didn't need the model to spend several minutes pondering the meaning of life for a simple Arduino script. I just needed the code.
I used a simple command to bypass the heavy reasoning phase: >>> /set nothink
The difference was huge:
Total Response Time: 2 minutes and 28 seconds for a complete, complex answer.
Content Quality: It wasn't just code. It gave me prerequisites, circuit wiring, full Arduino code, security tips, and even improvement ideas.
Memory Efficiency: The logs show this model is a perfect fit for the M1 Pro. It only used about 9.1 GiB of total memory, meaning 100% of the model layers (33/33) stayed on the GPU (Metal).
Technical Insights from the Logs
If you are troubleshooting your own local setup, here is what I learned from the Ollama server logs:
Check your Offloading: In my successful 9B run, the logs said:
offloaded 33/33 layers to GPU. This is the ideal case. If that number isn't 100%, your performance will be affected.Flash Attention is King: The logs confirmed
enabling flash attention. This helps the Mac handle long conversations without slowing down.The "Unified" Advantage: My M1 Pro was able to allocate a
recommendedMaxWorkingSetSizeof ~26GB. By using the 9B model (which only needs ~9GB), I left plenty of room for my system to breathe.
The Verdict: Is it a "Free" Coding Assistant?
Is a "free" local coding assistant possible? Yes—but resources matters. If you try to run massive models on a 32GB Mac, you'll feel like you're back in the era of dial-up.
If this continues to feel this good, I’m considering the ultimate "pro" move: Using this Mac as a headless AI server. I can connect it to a secure network and put it away on a shelf and connect to its "brain" from my other computers for chatting or from within VS Code (with the suitable plugin's of course) or even my phone. My main coding machine stays cool and quiet, while the M1 Pro does all the heavy lifting in the background.
I hope that my little weekend experiment has helped you in any way with insights or inspiration to set on your own journey of finding your own AI independence.
