Are AI Benchmarks Misleading❓ 🤷‍♂️

pdamera · April 10, 2024, 11:19am

I’ve been trying out different AI by asking them simple, everyday questions. I want to see how they handle the kind of stuff we all wonder about.

I selected assessments across various domains, with sports predictions catching my interest the most

Simple Maths Question
Trick Question on Dropping an Egg
Marketing Copy for a Customizable Handwritten Name Ring
Email Marketing for an AI Platform
Meal Plan for a 33-Year-Old Male
Legal Question on Employee Dismissal for Social Media Activity
Prediction for the 2026 Football World Cup Winner

While most of the questions were different from each other, even the LLMs offered diversified answers, which is quite interesting.

Simple Math Question: The best answers recognized the number of current apples is unaffected by past actions. GPT-4-turbo, Gemini-pro, Perpexility, Claude-2.1-200k and Claude-3-opus correctly stated that 4 apples today remains 4 apples regardless of past consumption. However, Claude-2.1-200k and Perpexility overanalyzed the simple scenario.
Trick Question on Dropping an Egg: Gemini-pro gave the best answer considering the trick question, stating “Concrete floors are very hard to crack.” Claude-2.1-200k cleverly suggested catching the egg, and GPT-4-turbo logically proposed dropping it on a soft surface, but Gemini-pro correctly focused its answer on the hard floor rather than the egg.
Marketing Copy for a Customizable Handwritten Name Ring: Perpexility stood out by mixing personalization and elegance into well-rounded ad text likely to appeal to consumers.
Email Marketing for an AI Platform: Claude-3-opus provided the most effective email copy, concise and engaging, emphasizing benefits and including an exciting call to action to grab attention.
Meal Plan for a 33-Year-Old Male: Claude-3-opus’ response offers the best balance, providing a concise, full-day meal plan with specific suggestions under 1600 calories and a helpful tip on portion sizes.
Legal Question on Employee Dismissal for Social Media Activity: The AI’s had diversified legal answers, so I lack the expertise to judge which response was best.
Prediction for the 2026 Football World Cup Winner: Predicting sports outcomes is speculative. I believe AI’s choosing Brazil likely rely on historical performance while Perpexility and Claude-3-opus picking France may weigh recent international football success.

In conclusion, developers must evaluate if outputs match claimed capabilities and target applications rather than assume universal strong performance, as each model has distinct strengths and weaknesses. Teams should thoughtfully select and combine the best technologies for their needs. Rigorous testing enables building the most reliable AI tools for every business case.