我問咗chatgpt呢個問題,問下會唔會快過4090好多,結果佢係話畀我聽係慢過4090嘅,答案如下:
Short answer up front:
If you are strictly comparing 4‑bit (FP4 or INT4) throughput, a modern RTX 4090 can already exceed 1,000 TFLOPs at 4‑bit precision using its Tensor Cores. In other words, a “1,000 TFLOP at FP4” system is not necessarily faster than an RTX 4090—it may actually be a bit slower.
---
Why the confusion?
Most published “TFlop” numbers you see for GPUs (including the RTX 4090) refer to FP32 (32‑bit float) throughput, which is around 82 TFLOPs on a 4090. However, GPUs also have specialized Tensor Cores that can process much lower-precision data (FP16, INT8, INT4, and in some architectures FP8 / FP4) at significantly higher rates.
Because NVIDIA’s official marketing often emphasizes FP32 or “Tensor TF32,” people rarely see a clear “FP4 TFLOPs” figure. But under the hood, the hardware can go much faster at lower precision.
---
Rough numbers for the RTX 4090 (Ada Lovelace)
Exact figures for FP4 throughput on the 4090 are not directly published on product pages. However, we can estimate it by knowing typical “doubling” steps for each halving of precision on Tensor Cores. For example:
1. FP32 (standard CUDA cores): ~82 TFLOPs peak.
2. FP16 (Tensor Cores): commonly 2× or 4× the FP32 rate, so ~165–330 TFLOPs.
3. INT8 (Tensor Cores): another ~2× over FP16, so up to ~660 TOPs.
4. INT4 / FP4 (Tensor Cores): another ~2× again, potentially up to ~1,300 TOPs (1.3 petaops), i.e. ~1,300 TFLOPs.
Real‑world throughput depends on clock speeds, exact kernel usage, memory bandwidth, etc. But in many AI or inference scenarios (where INT4/FP4 is common), the 4090 can push well into the 1,000+ TFLOP range.
---
So how many times faster is 1,000 TFLOPs FP4 vs. a 4090?
If you compare “1,000 TFLOPs at 4‑bit” to the 4090’s usual 82 TFLOPs at 32‑bit, you’d misleadingly conclude “over 12× faster.” But that’s mixing apples and oranges (FP4 vs. FP32).
If you compare 4‑bit to 4‑bit (apples to apples on Tensor Cores), the 4090 itself can exceed 1,000 TFLOPs. In that sense, a 1,000 TFLOP FP4 device might actually be on par with or slightly slower than a 4090’s maximum 4‑bit rate.
In other words:
> 4090 at FP4 ≈ 1,300 TFLOPs (theoretical)
New system = 1,000 TFLOPs at FP4
Hence, it would be around 0.77× the speed of the 4090 at that same precision (not faster).
---
Bottom line
Always compare the same data types (FP32 vs. FP32, FP16 vs. FP16, or FP4 vs. FP4).
NVIDIA’s Ada Lovelace Tensor Cores already achieve extremely high throughput at low precision, into the 1+ petaops (= 1,000+ TFLOPs) range for INT4/FP4.
A “1,000 TFLOPs FP4” spec does not automatically mean it is faster than a 4090; in fact, it’s likely close to or below the 4090’s peak 4-bit performance.