Nvidia Project Digits 究竟有幾快呢

jackso.art 2025-01-20 03:28:23

我問咗chatgpt呢個問題,問下會唔會快過4090好多,結果佢係話畀我聽係慢過4090嘅,答案如下：

Short answer up front:

If you are strictly comparing 4‑bit (FP4 or INT4) throughput, a modern RTX 4090 can already exceed 1,000 TFLOPs at 4‑bit precision using its Tensor Cores. In other words, a “1,000 TFLOP at FP4” system is not necessarily faster than an RTX 4090—it may actually be a bit slower.

---

Why the confusion?

Most published “TFlop” numbers you see for GPUs (including the RTX 4090) refer to FP32 (32‑bit float) throughput, which is around 82 TFLOPs on a 4090. However, GPUs also have specialized Tensor Cores that can process much lower-precision data (FP16, INT8, INT4, and in some architectures FP8 / FP4) at significantly higher rates.

Because NVIDIA’s official marketing often emphasizes FP32 or “Tensor TF32,” people rarely see a clear “FP4 TFLOPs” figure. But under the hood, the hardware can go much faster at lower precision.

---

Rough numbers for the RTX 4090 (Ada Lovelace)

Exact figures for FP4 throughput on the 4090 are not directly published on product pages. However, we can estimate it by knowing typical “doubling” steps for each halving of precision on Tensor Cores. For example:

1. FP32 (standard CUDA cores): ~82 TFLOPs peak.

2. FP16 (Tensor Cores): commonly 2× or 4× the FP32 rate, so ~165–330 TFLOPs.

3. INT8 (Tensor Cores): another ~2× over FP16, so up to ~660 TOPs.

4. INT4 / FP4 (Tensor Cores): another ~2× again, potentially up to ~1,300 TOPs (1.3 petaops), i.e. ~1,300 TFLOPs.

Real‑world throughput depends on clock speeds, exact kernel usage, memory bandwidth, etc. But in many AI or inference scenarios (where INT4/FP4 is common), the 4090 can push well into the 1,000+ TFLOP range.

---

So how many times faster is 1,000 TFLOPs FP4 vs. a 4090?

If you compare “1,000 TFLOPs at 4‑bit” to the 4090’s usual 82 TFLOPs at 32‑bit, you’d misleadingly conclude “over 12× faster.” But that’s mixing apples and oranges (FP4 vs. FP32).

If you compare 4‑bit to 4‑bit (apples to apples on Tensor Cores), the 4090 itself can exceed 1,000 TFLOPs. In that sense, a 1,000 TFLOP FP4 device might actually be on par with or slightly slower than a 4090’s maximum 4‑bit rate.

In other words:

> 4090 at FP4 ≈ 1,300 TFLOPs (theoretical)
New system = 1,000 TFLOPs at FP4

Hence, it would be around 0.77× the speed of the 4090 at that same precision (not faster).

---

Bottom line

Always compare the same data types (FP32 vs. FP32, FP16 vs. FP16, or FP4 vs. FP4).

NVIDIA’s Ada Lovelace Tensor Cores already achieve extremely high throughput at low precision, into the 1+ petaops (= 1,000+ TFLOPs) range for INT4/FP4.

A “1,000 TFLOPs FP4” spec does not automatically mean it is faster than a 4090; in fact, it’s likely close to or below the 4090’s peak 4-bit performance.

品客薯條 2025-01-20 03:39:17

會快過4090？你係咪發緊夢

jackso.art 2025-01-20 03:51:56

佢宣傳到好似好快咁,所以就要問下啦

守門員汽水_ 2025-01-20 10:07:48

主要係多ram可以行大model

詳盡天氣 2025-01-20 10:09:45

128 GB unified memory 喎

雖然未有 memory bandwidth spec

TKP 2025-01-20 10:13:02

人地個重點係128GB memory, 同32GB既5090比, 分別會係Project Digits跑到而5090跑唔到

springlala 2025-01-20 16:21:43

起碼唔洗千幾火火牛先...

申敏兒 2025-01-24 14:40:11

用途都唔一樣
純粹比較快慢冇咩意思