Simon Willison seeks DeepSeek-V4-Flash performance on Apple silicon Macs
Simon Willison queries real-world performance of DeepSeek-V4-Flash on Mac systems with 512 GB, 256 GB, 128 GB or less RAM. Discussion focuses on quantization for deployment, including 2-bit inference build using IQ2_XXS for w1/3, Q2_K for w2, and higher precision elsewhere. This achieves 8 tokens per second on MacBook M3 Max with 86.1 benchmark score. Community expresses interest in 1.5-bit flash quantization timeline.
First post
Why it matters
DeepSeek V4 Flash model contains 284 billion total parameters while activating only 13 billion during each inference pass.
Rohan Anil (former Gemini lead at Google DeepMind now at Anthropic) inquired about the timeline for 1.5 bit quantization.
Simon Willison (Datasette creator and Django co-creator) develops software that simplifies running large models on Apple silicon.
3 more posts
RO rohan anil@_arohan_·Quote tweet1.5 bit flash when?
SISimon Willison@simonw·
