AI Without Multiplication: Inside Ternary Models

Running an AI model takes a ridiculous amount of power. Most of that energy goes toward one tedious thing: multiplying giant lists of decimals over and over again. All because traditional graphics cards have to force multi-billion parameter matrices to multiply by complex input data. It takes massive, complicated circuits and tons of electricity just to crunch those numbers.

But what if we just... stopped multiplying?

That is exactly what ternary models do. The most popular version of this right now is Microsoft's BitNet b1.58. Instead of massive decimals, it trains models using just three simple values. And just like that, the heaviest math in AI vanishes.

What is a 1.58-Bit Weight?

Okay, the name sounds weird. 1.58 bits? Computers usually think in binary, which means a choice between a 0 or a 1. That represents exactly one bit of information.

But if you allow a third option, let's say -1, 0, and 1, the math changes. Thanks to information theory, encoding three distinct choices works out to exactly log_2(3) bits. Which rounds off to about 1.58.

Think of it like a stoplight with three colors instead of a basic on-off switch.

Because these numbers are so tiny, the model drastically shrinks. I've seen a normal 2-billion parameter model take up around 4 gigabytes of memory. With this trick? It drops to under 700 megabytes. Seven hundred megabytes. You can suddenly fit massive models directly into the local cache of a regular phone or a cheap laptop.

The Hardware Reality: Additions over Multiplications

Here is how the trick actually works inside the silicon chip.

In a standard AI setup, the computer takes the incoming data (X) and multiplies it by the model's brain weights (W). With BitNet, the incoming data gets turned into simple whole numbers, and the weights are strictly -1, 0, or 1.

Look at what happens when you multiply any number by those three choices:

Weight is 1: The number stays exactly the same (X). No mathematical work needed.
Weight is -1: You just flip the sign to a negative (-X). A basic toggle.
Weight is 0: The number becomes zero. You just ignore it and move on.

Notice anything missing? Multiplication.

The chip doesn't need to use expensive, power-hungry floating-point multipliers anymore. The whole process turns into simple integer addition and subtraction. Integer adders are tiny. They take up a fraction of the physical space on a chip and use barely any power. This opens up a clear path to running heavy AI on low-power devices, smartwatches, and everyday CPUs without needing massive, expensive GPU clusters.

Accuracy at Scale: The 3-Billion Parameter Turning Point

Usually, if you take a smart AI and aggressively crush its numbers down to save space, it breaks. The output degrades into pure gibberish because you ruined the model's brain.

But BitNet is different. It doesn't compress an old model. It builds the model this way from day one.

During the training phase, the computer keeps a secret, high-precision "master copy" in the background to handle the tiny gradient updates. But the actual model running the text only ever uses the rounded -1, 0, and 1 values.

Microsoft's benchmarks showed something wild. Once these ternary models get bigger than 3 billion parameters, the performance gap disappears. They perform just as well as heavy, full-precision models. The AI basically compensates for the simple numbers by just being larger.

Practical Bottlenecks and Current Hardware

So, what is the catch? Why aren't we all running these models right now?

Well, our current graphics cards are stubborn. They were built for old-school, heavy math or standard integer formats like 8-bit and 4-bit. They do not have native, hard-wired circuits meant for 1.58-bit files.

Right now, developers have to use clever software runtimes like bitnet.cpp to make consumer-grade processors unpack these ternary weights on the fly. It still runs 2x to 4x faster and keeps your device cool, but it is a workaround.

The real shift will happen when hardware factories start building custom chips optimized around simple accumulators instead of massive banks of multipliers. It will completely alter the physical limits, temperatures, and costs of running AI.

Back to Main | Share

Blog

AI Without Multiplication: Inside Ternary Models