Public Dataset

What people want to use AI for, and which models work

What's in the dataset

Every time someone uses Bearing, we record the task classification (type, subtype, complexity), the user's priority ranking, which models were recommended and at what scores, which model the user selected, and -- optionally -- whether it worked.

We also publish head-to-head comparison data: which two models were compared, and which one the user preferred.

What we never collect: no raw task descriptions, no prompts, no email addresses, no IP addresses. All data is anonymised before storage.

Why this matters

There is no existing public dataset of real-world “task → model → did it work?” decisions. Benchmarks test raw capability; Bearing tests fit -- whether a model is the right choice for what someone actually wants to do.

This data is useful for anyone building routing systems, recommendation engines, or evaluation tools for AI models.

Download

Recommendation data

Task classifications, model recommendations, selections, and outcomes.

Comparison data

Head-to-head model preferences with task context.

Methodology

  • Task classification: Claude Haiku with a confidence threshold of 0.6. Tasks below this threshold go through a clarification step.
  • Scoring: 7-factor weighted scoring based on user-ranked priorities. See About for details.
  • Selection signal: which model the user chose and at what rank in the recommendation list.
  • Outcome signal: optional thumbs up/down with structured failure reasons (e.g. quality, speed, cost, hallucination).

Schema

Recommendation dataset

FieldTypeDescription
task_typestringPrimary task category
task_subtypestringSpecific task sub-category
complexitystringlow | medium | high
input_lengthstringshort | medium | long | very_long
needs_visionbooleanRequires image/vision capabilities
needs_toolsbooleanRequires tool use / function calling
needs_codebooleanRequires code generation
is_recurringbooleanRecurring or repeated task
priority_orderstring[]User-ranked priority factors
models_recommendedobject[]{slug, rank, weighted_score}
model_selectedobject{slug, recommended_rank}
outcome_successboolean?User-reported success
failure_reasonstring?Failure reason if applicable
task_datedateDate the task was created

Comparison dataset

FieldTypeDescription
task_typestringPrimary task category
model_a_slugstringFirst model in comparison
model_b_slugstringSecond model in comparison
preferredstringmodel_a | model_b | tie
preference_reasonstring?Reason for preference
task_datedateDate of comparison

Licence

This dataset is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You are free to share and adapt the data for non-commercial purposes with attribution.

Built by The Good Ship · good-ship.co.uk