Kimi K2 - 1T MoE, 32B active params

大棍巴

8 回覆
1 Like 0 Dislike
大棍巴 2025-07-12 19:25:39
大棍巴 2025-07-12 19:26:07
Benchmark                        Metric                      Kimi K2   DeepSeekV3   Qwen3-235B   Claude S4   Claude Opus   GPT-4.1    Gemini 2.5
                                                           Instruct    (0324)       -A22B       (no think)   (no think)               Flash (0520)
------------------------------  --------------------------  --------  ----------  -----------  ----------  ------------  ---------  -------------
Coding Tasks
LiveCodeBench v6 (Aug24-May25)  Pass@1                        53.7       46.9         37.0         48.5         47.4         44.7         44.7
OJBench                         Pass@1                        27.1       24.0         11.3         15.3         19.6         19.5         19.5
MultiPL-E                       Pass@1                        85.7       83.1         78.2         88.6         89.6         86.7         85.6
SWE-bench Verified
(Agentless Coding)              Acc (no test)                 51.8       36.6         39.4         50.2         53.0         40.8         32.6
SWE-bench Verified
(Agentic Coding)                Single Attempt                65.8       38.8         34.4         72.7*        72.5*        54.6           —
                                Multiple Attempts             71.6         —            —          80.2         79.4*          —            —
SWE-bench Multilingual          Single Attempt                47.3       25.8         20.9         51.0           —          31.5           —
TerminalBench                   Inhouse Framework             30.0         —            —          35.5         43.2          8.3           —
                                Terminus                      25.0       16.3          6.6           —            —          30.3         16.8
Aider-Polyglot                  Acc                           60.0       55.1         61.8         56.4         70.7         52.4         44.0

Tool Use Tasks
Tau2 retail                     Avg@4                         70.6       69.1         57.0         75.0         81.8         74.8         64.3
Tau2 airline                    Avg@4                         56.5       39.0         26.5         55.5         60.0         54.5         42.5
Tau2 telecom                    Avg@4                         65.8       32.5         22.1         45.2         57.0         38.6         16.9
AceBench                        Acc                           76.5       72.7         70.5         76.2         75.6         80.1         74.5
大棍巴 2025-07-12 19:27:05
Benchmark                        Metric                      Kimi K2   DeepSeekV3   Qwen3-235B   Claude S4   Claude Opus   GPT-4.1    Gemini 2.5
                                                           Instruct    (0324)       -A22B       (no think)   (no think)               Flash (0520)
------------------------------  --------------------------  --------  ----------  -----------  ----------  ------------  ---------  -------------
Math & STEM Tasks
AIME 2024                       Avg@64                        69.6       59.4*        40.1*        43.4         48.2         46.5         61.3
AIME 2025                       Avg@64                        49.5       46.7         24.7*        33.1*        33.9*        37.0         46.6
MATH-500                        Acc                           97.4       94.0*        91.2*        94.0         94.4         92.4         95.4
HMMT 2025                       Avg@32                        38.8       27.5         11.9         15.9         15.9         19.4         34.7
CNMO 2024                       Avg@16                        74.3       74.7         48.6         60.4         57.6         56.6         75.0
PolyMath-en                     Avg@4                         65.1       59.5         51.9         52.8         49.8         54.0         49.9
ZebraLogic                      Acc                           89.0       84.0         37.7*        73.7         59.3         58.5         57.9
AutoLogi                        Acc                           89.5       88.9         83.3         89.8         86.1         88.2         84.1
GPQA-Diamond                    Avg@8                         75.1       68.4*        62.9*        70.0*        74.9*        66.3         68.2
SuperGPQA                      Acc                           57.2       53.7         50.2         55.7         56.5         50.8         49.6
Humanity's Last Exam (Text)     Score                          4.7        5.2          5.7          5.8          7.1          3.7          5.6

General Tasks
MMLU                            EM                            89.5       89.4         87.0         91.5         92.9         90.4         90.1
MMLU-Redux                      EM                            92.7       90.5         89.2         93.6         94.2         92.4         90.6
MMLU-Pro                        EM                            81.1       81.2*        77.3         83.7         86.6         81.8         79.4
IFEval                          Prompt Strict                 89.8       81.1         83.2*        87.6         87.4         88.0         84.3
Multi-Challenge                 Acc                           54.1       31.4         34.0         46.8         49.0         36.4         39.5
SimpleQA                        Correct                       31.0       27.7         13.2         15.9         22.8         42.3         23.3
Livebench                       Pass@1                        76.4       72.4         67.6         74.8         74.6         69.8         67.8
天才小釣手 2025-07-12 20:26:26
咁撚大
大棍巴 2025-07-12 20:37:21
但作為MoE,得3XB active,其實都可以用返CPU+Host memory來行,都做到十幾 tps。
薩爾達 2025-07-13 13:42:34
https://www.kimi.com/
去左睇下一般般
冇得生圖
同DEEPSHIT差唔多
除左多左幾個預設PROMPT嘅BOT
debugger; 2025-07-13 15:26:51
sota
debugger; 2025-07-13 15:27:21
1tb ram
吹水台自選台熱 門最 新手機台時事台政事台World體育台娛樂台動漫台Apps台遊戲台影視台講故台健康台感情台家庭台潮流台美容台上班台財經台房屋台飲食台旅遊台學術台校園台汽車台音樂台創意台硬件台電器台攝影台玩具台寵物台軟件台活動台電訊台直播台站務台黑 洞