AI model makers love to flex their benchmarks scores. But how trustworthy are these numbers? What if the tests themselves are rigged, biased, or just plain meaningless? OpenAI’s o3 debuted with claims that, having been trained on a publicly available ARC-AGI dataset, the LLM scored a " breakthrough 75.7 percent " on ARC-AGI’s semi-private evaluation dataset with a $10K compute limit. ARC-AGI is a set of puzzle-like inputs that AI models try to solve as a measure of intelligence. Google’s recently introduced Gemini 2.0 Pro, the web titan claims, scored 79.1 percent on MMLU-Pro – an enhanced version of the original MMLU test designed to test natural language understanding. Meanwhile, Meta’s Llama-3 70B claimed a score of 82 percent on MMLU 5-shot back in April 2024. "5-shot" refers to […]
Original web page at www.theregister.com