Home
Blog
AI News
Gemini Contractors Are Scoring AI-Generated Expert Responses Without Required Expertise

Gemini Contractors Are Scoring AI-Generated Expert Responses Without Required Expertise

Updated:June 13, 2025

Reading Time: 3 minutes

A board with check marks (Gemini Contractors Are Forced to Rate the AI Inaccurately)

When it comes to AI, we often marvel at its brilliance: answering questions, generating images, and even composing music. But behind the curtain of Google’s latest generative AI model, Gemini, lies a complex web of human effort, primarily from contractors tasked with rating AI-generated responses.

Recent changes in Google’s guidelines for these contractors, however, have sparked concerns about the accuracy and reliability of Gemini’s output, especially on sensitive topics.

What’s the Role of Contractors in AI Development?

AI isn’t perfect; it’s trained by people. Contractors, often referred to as prompt engineers or analysts, evaluate and rate AI responses to improve performance. For Gemini, these contractors (hired through GlobalLogic, a firm owned by Hitachi) assess responses based on factors like truthfulness and clarity.

For example, if the AI generates an answer about climate change, a contractor’s job might include verifying the information, checking its tone, and ensuring it aligns with the truth. But what happens when the question is highly technical or outside the contractor’s expertise? Until recently, the answer was simple: they could skip it.

A Change in Policy Raises Eyebrows

Previously, contractors working on Gemini could opt out of evaluating prompts that required specialized knowledge, such as complex engineering problems or niche medical queries. This system ensured that evaluations were left to those with the expertise needed for an accurate assessment.

However, a recent shift in Google’s policy now requires contractors to evaluate all prompts, regardless of their expertise. The new guideline reads:

“You should not skip prompts that require specialized domain knowledge. Rate the parts of the prompt you understand and note your lack of domain expertise.”

The only exceptions are prompts that are incomplete or contain harmful content requiring special consent to review.

Why This Change Matters

The updated policy has drawn criticism from contractors and industry watchers. Forcing individuals without proper expertise to evaluate technical or sensitive AI responses, such as those about rare diseases or advanced engineering, could lead to inaccurate assessments.

One contractor expressed frustration in internal correspondence, saying, “I thought the point of skipping was to increase accuracy by giving it to someone better?”

This concern isn’t trivial. If Gemini is trained on flawed evaluations, the consequences could range from minor inaccuracies to significant misinformation in areas like healthcare or legal advice. Imagine asking an AI about symptoms of a rare condition, only to receive a poorly evaluated and misleading response.

Efficiency and Accuracy

The policy change may reflect a push for efficiency, ensuring all prompts are rated quickly. But is this efficiency worth the potential compromise in quality? AI models like Gemini are becoming integral to everyday tools, assisting in everything from education to medical queries. Trust in these systems hinges on their accuracy.

How Could This Impact Users?

For everyday users, this could mean:

Inconsistent Quality: Gemini’s responses might vary more widely in accuracy, especially on complex topics.
Erosion of Trust: If users repeatedly encounter incorrect information, they may lose faith in AI tools.
Potential Harm: Inaccurate advice on sensitive issues like health or finance could lead to real-world consequences.

A Broader Look at AI Training Practices

Google’s situation highlights a larger issue in AI development: the reliance on human evaluators who may not always have the necessary expertise.

Other companies, like OpenAI, face similar challenges. Balancing speed, cost, and quality in AI training is a universal dilemma. However, the stakes are particularly high for tools like Gemini, which aim to handle diverse, real-world queries.

What Can Be Done?

To address these challenges, companies might consider:

Specialized Evaluators: Assigning tasks to contractors with relevant backgrounds.
Transparent Policies: Clearly communicating limitations of AI-generated responses.
User Safeguards: Implementing disclaimers for sensitive topics or uncertain outputs.

Why Trust Matters More Than Ever

As generative AI becomes increasingly integrated into our lives, trust is paramount. Users must feel confident that the information provided is accurate and reliable, especially when it touches on critical areas like health or finance.

Google’s new guidelines may speed up the training process, but the risk of eroding trust in Gemini’s outputs could outweigh the benefits. For now, the eyes of the tech world—and everyday users—are watching to see how these changes will play out.

Tags:

AI technology, artificial intelligence, generative AI

Lolade

Contributor & AI Expert