Published on October 3, 2025 6:26 PM GMT

TL;DR

Independent evaluations are both possible and valuable. Our goal is to widen the conversation on decentralized, reproducible, context-aware evaluations as public infrastructure for AI oversight, especially in regions and languages that frontier work often overlooks.

Our recommendations (based on what actually worked for us):

Treat evaluation like an experiment, not a leaderboard.

Use repetitions when decoding stochastically.

Aim for cost-aware rigor.

If more small teams adopt this approach—local languages, transparent methods, light but reliable stats—we’ll get a healthier, more trustworthy evaluation ecosystem.

Why independent evaluations?

Most well-known benchmarks are built and interpreted by a small number of well-resourced actors—big labs, elite universities, and a few private platforms. That concentration helped the field move fast, but it also created blind spots.

What centralization gets wrong (in practice):

Narrow scope and linguistic bias.

Incentives that don’t always line up with truth.

Opacity and weak reproducibility.

High barriers to entry.

We don’t think centralized efforts are “bad.” We think they’re insufficient on their own—especially as models show up in places where failure matters (education, healthcare, legal workflows, public services). At that point, evaluation stops being just a research nicety and becomes a governance question: who gets to define “good enough,” for whom, and under what conditions?

Why decentralized efforts matter (and work):

surface local realities

create checks and balances

lower dependency risk

invite participation

Our own experience with AI4Math and our follow-up on repetitions convinced us that small teams can produce useful evaluations if they keep things simple and scientific: make problems that actually reflect your context; publish your configs; and treat the exercise like an experiment, not a leaderboard race. That approach lets us give cost-aware guidance without needing huge infrastructure.

This isn’t just for research groups. If you’re a practitioner deciding whether to deploy an LLM in a classroom, a clinic, or a back-office process, you need task-specific, context-specific evidence. A lightweight, experimental evaluation—designed around your real use case—can tell you far more than a global leaderboard ever will. Done right, it becomes standard operating procedure: define the task, test transparently, report uncertainty, and only then make the call.

Bottom line: Independent evaluations are both possible and valuable. They widen the lens, strengthen trust, and help the community converge on methods that travel: transparent setups, reproducible runs, and attention to the contexts where these systems actually live.

What Have We Built?

If centralized, English-first evaluations miss important contexts, the remedy is simple: build local, transparent, reproducible setups that others can copy. That’s what we did, first by creating a Spanish-native dataset, then by defining a clear comparison setup, and finally by running the whole thing as an experiment to learn how many repetitions you actually need.

A Spanish-native, university-level math dataset

https://arxiv.org/abs/2505.18978

A clear, minimal evaluation scenario

Experimental framing (how many runs do you

really

need?)

https://arxiv.org/abs/2509.24086

Our Findings

Here are the practical recommendations we now follow (and suggest to others). They’re written so you can consider implementing them directly if you’re doing something similar.

Calibrate repetitions before you scale.

Lead with uncertainty, not just a sorted table.

Keep the setup simple and fixed.

Stress-test the right skills.

When models are close, don’t overclaim.

Document what others need to rerun you.

How Much is “Affordable”?

So far, we have shown that it’s feasible for small teams to run independent LLM evaluations. Whether you should invest in producing these evaluations should be decided case by case. That said, running these “experiments” to assess LLM capabilities can be generalized across multiple fields. We therefore want to share roughly how much we invested, so you can factor this into planning your own evaluations.

For the AI4Math benchmark, we spent approximately $1,750 on dataset creation (hackathon prize pool) and $290 on miscellaneous compute (mostly LLM API credits), bringing the direct out-of-pocket total to about $2,040 USD. This figure excludes other costs such as team salaries and online workspace expenses, which can vary widely by region.

In our case, most team members were employed on an hourly or, at most, half-time basis, and this benchmarking work was not our only responsibility, we also ran AI Safety mentorship programs concurrently. We estimate that a comparable team of ~3 FTEs could replicate our efforts in 3–4 months. This is encouraging for small teams and supports the idea that these evaluations are not exclusive to large or highly specialized research groups.

Conclusion

Our experience shows that independent evaluations don’t need to be massive or resource-intensive to be meaningful. With a clear experimental mindset, simple but transparent setups, and context-relevant tasks, small teams can generate results that genuinely inform both local decisions and the broader conversation about model reliability.

The strength of this approach lies not in building the next universal leaderboard, but in multiplying perspectives: more languages, more domains, more contexts where failure modes matter. Each evaluation becomes a piece of shared infrastructure, open, reproducible, and trustworthy.

As models continue to shape high-stakes environments, the question isn’t just “who performs best on a global benchmark,” but “what works well enough, under what conditions, for whom?” Answering that requires decentralization. Our hope is that others—whether researchers, practitioners, or local communities—take up these recommendations, adapt them to their needs, and keep building toward a healthier, more plural evaluation ecosystem.

Discuss

TL;DR

Why independent evaluations?

What centralization gets wrong (in practice):

Why decentralized efforts matter (and work):

What Have We Built?

Our Findings

How Much is “Affordable”?

Conclusion

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签