Bar Chart · 柱状图#paired-bar#category-colours#reference-line#many-models#tilted-labels

PHYBench · 19-model paired bars with category colours and human reference lines

PHYBench · 19 模型配对柱（Acc + EED）+ 类别配色 + 人类参考线

Reproduction of PHYBench Figure 1. For each of 19 LLMs the bar chart shows Accuracy (darker) and EED Score (lighter) side-by-side, with three category colour families: Reasoning Models (blue), General Models (maroon/pink) and 32B Models (brown). Two red dashed horizontal lines mark the human-expert Accuracy (61.9) and EED (70.4). Numeric labels above every bar; tilted x-axis labels.

PHYBench Figure 1 复现。19 个 LLM，每个并排显示 Accuracy（深色）和 EED Score（浅色），分三个色系：推理模型（蓝）、通用模型（暗红/粉）、32B 模型（棕）。两条红虚线水平参考人类专家 Accuracy（61.9）和 EED（70.4）。每根柱顶有数值标签，x 轴标签倾斜。

@paper · 来自论文

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

PHYBench：大语言模型物理感知与推理的整体评测

Shi Qiu et al. (Peking University) · arXiv 2025

arXiv:2504.16074 ↗PDF ↗code ↗

// original from paper · 论文原图

// reproduced via phybench_model_perf.py · 脚本复现download png

phybench_model_perf.py

download .py

"""PHYBench · Per-model paired bars with category-coded colours and human reference lines.

Reproduction of PHYBench Figure 1 (Model performance on PHYBench).
Source: PHYBench: Holistic Evaluation of Physical Perception and Reasoning
in Large Language Models, arXiv:2504.16074.

Each model is shown as a pair of side-by-side bars (Accuracy = darker,
EED Score = lighter). Models are coloured by category:
  - Reasoning Models : blue / light blue
  - General Models   : maroon / pink
  - 32B Models       : brown / tan
Two red dashed horizontal lines mark the human-expert baselines.
"""

import matplotlib.pyplot as plt
import numpy as np

plt.rcParams.update({
    "font.family": "sans-serif",
    "font.sans-serif": ["DejaVu Sans", "Arial"],
})

REASONING_DARK = "#1F6EB1"
REASONING_LIGHT = "#9DC4E5"
GENERAL_DARK = "#A52A6A"
GENERAL_LIGHT = "#F0BAD0"
LRM32_DARK = "#9C5A2B"
LRM32_LIGHT = "#E8D2B6"
HUMAN_LINE = "#C4191C"

MODELS = [
    ("Gemini 2.5 pro",                36.9, 49.5, "reasoning"),
    ("o3 (high)",                     34.8, 46.4, "reasoning"),
    ("o4-mini (high)",                29.4, 41.9, "reasoning"),
    ("DeepSeek-R1",                   25.0, 37.9, "reasoning"),
    ("o3-mini (high)",                25.0, 37.3, "reasoning"),
    ("o4-mini",                       24.9, 36.4, "reasoning"),
    ("o3-mini",                       21.3, 33.3, "reasoning"),
    ("Grok 3 Beta",                   21.2, 32.0, "reasoning"),
    ("Gemini 2.0 Flash Thinking",     18.2, 30.3, "reasoning"),
    ("Claude 3.7 Sonnet Thinking",    18.0, 27.4, "reasoning"),
    ("o1",                            15.3, 27.1, "reasoning"),
    ("o3-mini (low)",                 13.7, 25.3, "reasoning"),
    ("DeepSeek-V3",                   13.6, 24.2, "general"),
    ("Claude 3.7 Sonnet",             12.9, 23.8, "general"),
    ("GPT-4.1",                       13.2, 23.7, "general"),
    ("GPT-4o",                         7.0, 15.4, "general"),
    ("Qwen2.5-max",                    6.1, 13.9, "general"),
    ("QwQ-32B",                        2.6,  4.5, "lrm32"),
    ("DeepSeek-R1-Distill-Qwen-32B",   1.2,  3.2, "lrm32"),
]

CAT_COLORS = {
    "reasoning": (REASONING_DARK, REASONING_LIGHT),
    "general":   (GENERAL_DARK, GENERAL_LIGHT),
    "lrm32":     (LRM32_DARK, LRM32_LIGHT),
}

names = [m[0] for m in MODELS]
acc = np.array([m[1] for m in MODELS])
eed = np.array([m[2] for m in MODELS])
cats = [m[3] for m in MODELS]
n = len(names)

fig, ax = plt.subplots(figsize=(13, 5.4))

x = np.arange(n)
W = 0.4

for i, c in enumerate(cats):
    dark, light = CAT_COLORS[c]
    ax.bar(x[i] - W / 2, acc[i], width=W, color=dark,
           edgecolor=dark, linewidth=0.4, zorder=3)
    ax.bar(x[i] + W / 2, eed[i], width=W, color=light,
           edgecolor=light, linewidth=0.4, zorder=3)

    ax.text(x[i] - W / 2, acc[i] + 0.6, f"{acc[i]:.1f}",
            ha="center", va="bottom", fontsize=8, color="black")
    ax.text(x[i] + W / 2, eed[i] + 0.6, f"{eed[i]:.1f}",
            ha="center", va="bottom", fontsize=8, color="black")

ax.axhline(70.4, color=HUMAN_LINE, lw=1.0, ls=(0, (5, 3)), zorder=2)
ax.axhline(61.9, color=HUMAN_LINE, lw=1.0, ls=(0, (5, 3)), zorder=2)

ax.text(0.05, 70.4 + 0.6, "Human Experts (EED Score): 70.4",
        color=HUMAN_LINE, fontsize=9, fontweight="normal", va="bottom")
ax.text(0.05, 61.9 + 0.6, "Human Experts (Accuracy): 61.9",
        color=HUMAN_LINE, fontsize=9, fontweight="normal", va="bottom")

ax.set_ylim(0, 76)
ax.yaxis.set_major_locator(plt.MultipleLocator(10))
ax.set_ylabel("Score", fontsize=10)

ax.set_xticks(x)
ax.set_xticklabels(names, rotation=45, ha="right", fontsize=9)

ax.set_xlim(-0.7, n - 0.3)
ax.grid(axis="y", ls=":", lw=0.5, color="#999", zorder=0)
ax.set_axisbelow(True)
for sp in ("top", "right"):
    ax.spines[sp].set_visible(False)
for sp in ("left", "bottom"):
    ax.spines[sp].set_color("#444")

handles = [
    plt.Rectangle((0, 0), 1, 1, color=REASONING_DARK,  label="Reasoning Models (Accuracy)"),
    plt.Rectangle((0, 0), 1, 1, color=REASONING_LIGHT, label="Reasoning Models (EED Score)"),
    plt.Rectangle((0, 0), 1, 1, color=GENERAL_DARK,    label="General Models (Accuracy)"),
    plt.Rectangle((0, 0), 1, 1, color=GENERAL_LIGHT,   label="General Models (EED Score)"),
    plt.Rectangle((0, 0), 1, 1, color=LRM32_DARK,      label="32B Models (Accuracy)"),
    plt.Rectangle((0, 0), 1, 1, color=LRM32_LIGHT,     label="32B Models (EED Score)"),
]
leg = ax.legend(handles=handles, loc="upper right", title="Model Categories",
                fontsize=9, title_fontsize=9.5, framealpha=1.0,
                edgecolor="#888")
leg.get_frame().set_linewidth(0.6)

plt.savefig("phybench_model_perf.png", dpi=300, bbox_inches="tight",
            facecolor="white")
plt.close()
print("saved: phybench_model_perf.png")