https://developer.nvidia.com/blog/accelerating-matrix-multiplication-with-block-sparse-format-and-nvidia-tensor-cores/
cusparse應該一早support 4090, 20系開始已經有tensorcore support
如果你要去到FP8 要有hopper architecture先得, 4090係Ada Lovelace, 但係單核時脈高過ampere core好多, 你夠vram就照去馬用FP16啦, 唔夠vram即係4090唔夠你用要上A6000/H100
同埋係咪行FP64唔係睇你個project有幾大,而係你個精度要求係唔係個type range內, 我平時pascal到a100級都有用, 要睇cluster allocate到咩resource去計數, 為左通用性都係行FP32, 因為你deploy既workstation好大機會都係geforce卡, 你training上到FP64最後又trim返落FP32係浪費computation resource
幫個同事做光學simulation, 因為精度要求可以用int16加速去計, 再係post processing cast返去int先計最終output, 成套code都係多幾行要改
你加大左個precision連pcie bandwidth同ram要求都大左好多, data 上落cluster transfer time都係成本, 追精度呢d野留返俾n記工程師做, 做experimental science好少實際用到16bit以上既需求, 去到百萬級細胞追蹤先唔夠integer個sparse table 某d column要上32bit