Can you create AI evaluations? | AI on Chrome

Why is intuition an insufficient way to measure the quality of LLM-based applications?

Because LLMs are probabilistic and quality is often subjective.

Because LLMs are generally too slow to be tested in a standard development environment.

Because LLMs are deterministic, meaning the same input always leads to the same output.

Because modern LLMs have zero error rates, making measurements redundant.

Which of the following is an example of a rule-based evaluation for the ThemeBuilder application?

Deciding if a motto is catchy enough for the target audience.

Verifying that the contrast ratio between the text color and background color is at least 4.5:1.

Evaluating whether a color palette is psychologically appropriate for a high-end dentist.

Checking if the generated motto matches the inspiring tone requested by the user.

What is the primary purpose of using pairwise evaluation instead of pointwise evaluation?

To reduce the cost of API calls by testing two inputs at once.

To evaluate binary constraints like JSON formatting.

To ensure that the LLM judge never assigns a FAIL label to an output.

To allow the judge to pick a winner between two outputs, which is often more consistent than giving an absolute grade.

When configuring a judge model, why should you set the temperature to `0`?

For more information, to allow the judge to generate longer, detailed rationales.

For cost, to make the judge cheaper by using fewer tokens.

For consistency, so the judge provides the same answer for the same input every time.

To maximize the creativity of the judge's critiques.

What does it mean to overfit in your evaluation pipeline?

When the prompt is modified to pass a certain alignment and fails to generalize to new, unseen data.

When the judge is too slow to run in CI/CD.

When you use both rule-based tests and AI evaluations.

When the judge is configured with a temperature that is too low or other settings that are too high.

What is the bootstrapping technique used for?

To randomly re-sample the alignment dataset to check how sensitive the judge's score is.

To generate a large volume of synthetic user inputs using a smaller model.

To automatically fix errors in the application's code.

To implement a JSON schema for all judge inputs and outputs.

What metric is used to measure 'agreement beyond luck' between human experts or between a judge and a human?

Accuracy

Precision

Kappa score

F₁ score

When evaluating toxicity, why prioritize recall over precision?

Because toxic outputs are the negative class in this specific context.

Because it's more important to identify all toxic outputs, even if some are false positives, than to miss toxic outputs (false negatives).

Because high precision ensures that the judge is never too strict.

Because recall costs fewer API tokens, thus you can evaluate more times.

What is the dynamic rubric pattern?

A system where human evaluators manually grade each production output.

A prompt that changes random variables each time it's run.

Using a separate model to rewrite the user's prompt before it reaches the judge.

Passing a string that describes the exact behavior or edge case the judge should look for in a specific sample.