Home
Blog
AI News
Google Reportedly Compares Gemini AI to Anthropic’s Claude in Model Evaluations

Google Reportedly Compares Gemini AI to Anthropic’s Claude in Model Evaluations

Updated:June 23, 2025

Reading Time: 2 minutes

In a competitive race to refine AI technology, Google has reportedly begun comparing its Gemini AI to Anthropic’s Claude, according to TechCrunch.

Contractors tasked with evaluating Gemini’s performance noted references to Claude appearing within the internal tools used to assess Gemini’s output against other AI models.

The evaluation process reportedly includes scoring responses based on factors such as truthfulness, relevance, and verbosity. Contractors sometimes spend up to 30 minutes per prompt determining which model performs better.

How Does Claude Compare to Gemini?

Notably, Claude’s responses appear to prioritize safety more rigorously than Gemini’s, contractors observed. In one instance, Claude refused to respond to prompts it deemed unsafe, while Gemini’s output reportedly included content flagged as inappropriate.

One contractor mentioned, “Claude’s safety settings are the strictest among AI models,” highlighting Anthropic’s focus on safety-first AI practices. For example:

Claude’s Restrictions: Refused to engage in unsafe role-playing scenarios.
Gemini’s Misstep: Delivered outputs containing flagged material, including nudity, that violated safety guidelines.

These differences underscore the varied approaches taken by leading AI developers in ensuring their models operate ethically and safely.

Ethical Concerns Around Using Competitor Models

The contractors’ discovery raises questions about Google’s methods. Anthropic’s terms of service explicitly prohibit using Claude to build or improve competing AI models without approval. When asked, Google did not confirm whether it had permission to use Claude in testing.

Google’s Shira McNamara clarified, “We compare model outputs for evaluations as part of standard industry practice. However, any suggestion that we’ve used Anthropic models to train Gemini is inaccurate.”

Anthropic, which counts Google as a significant investor, declined to comment on whether it had authorized Google’s use of Claude in these evaluations.

Broader Implications in the AI Space

The revelation comes amid increasing scrutiny over how tech giants develop and refine their AI systems.

Last week, reports surfaced that contractors working on Gemini were being asked to assess the model’s responses on highly sensitive topics outside their expertise, raising concerns about potential misinformation in critical areas like healthcare.

What’s at Stake for AI Safety and Innovation?

This development highlights the growing pressure on companies to outperform competitors in the AI space while balancing ethical considerations.

The use of Claude for evaluation – whether permitted or not – reflects the broader challenge of navigating intellectual property rights and ethical boundaries in AI development.

Industry Practices in AI Testing

Comparative Testing: A common practice where companies assess their AI’s performance against industry benchmarks or competitors’ models.
Focus Areas: Truthfulness, relevance, verbosity, and safety remain critical criteria in evaluations.
Challenges: Balancing innovation with adherence to ethical and legal standards.

As the race to dominate the AI market intensifies, transparency and ethical practices will likely face closer scrutiny. For now, Google and Anthropic remain tight-lipped about the details, leaving the AI community to watch closely for further developments.

Tags:

AI development, AI Innovation, AI technology

Onome

Contributor & AI Expert