Public Dataset

What people want to use AI for, and which models work

What's in the dataset

Every time someone uses Bearing, we record the task classification (type, subtype, complexity), the user's priority ranking and the normalised factor weights the recommender actually applied, which models were recommended and at what scores, any multi-stage pipeline plan, which model the user selected (if any), and -- optionally -- whether it worked. Tasks that reached the recommendation stage are included even when the user did not pick a model.

We also publish head-to-head comparison data: which two models were compared, and which one the user preferred.

What we never collect: no raw task descriptions, no prompts, no email addresses, no IP addresses. All data is anonymised before storage.

Why this matters

There is no existing public dataset of real-world “task → model → did it work?” decisions. Benchmarks test raw capability; Bearing tests fit -- whether a model is the right choice for what someone actually wants to do.

This data is useful for anyone building routing systems, recommendation engines, or evaluation tools for AI models.

Download

Recommendation data

Task classifications, model recommendations, selections, and outcomes.

Download JSON Download CSV

Comparison data

Head-to-head model preferences with task context.

Download JSON Download CSV

Methodology

Task classification: Claude Haiku with a confidence threshold of 0.6. Tasks below this threshold go through a clarification step.
Scoring: 7-factor weighted scoring based on user-ranked priorities. See About for details.
Selection signal: which model the user chose and at what rank in the recommendation list.
Outcome signal: optional thumbs up/down with structured failure reasons (e.g. quality, speed, cost, hallucination).

Schema

Recommendation dataset

Field	Type	Description
task_type	string	Primary task category
task_subtype	string	Specific task sub-category
complexity	string	low \| medium \| high
input_length	string	short \| medium \| long \| very_long
needs_vision	boolean	Requires image/vision capabilities
needs_tools	boolean	Requires tool use / function calling
needs_code	boolean	Requires code generation
needs_reasoning	boolean	Requires multi-step reasoning
is_recurring	boolean	Recurring or repeated task
mode	string	recommend \| pipeline \| validate
priority_order	string[]	User-ranked priority factors
excluded_factors	string[]	Factors the user opted out of
factor_weights	object?	Normalised per-factor weights actually applied
pipeline_stages	object[]?	Multi-stage plan if recommended
classification_schema_version	string	v0.7 or v0.8 — task_type enum used
models_recommended	object[]	{slug, rank, weighted_score}
local_recommendations	object[]	{slug, rank, effective_quality, quant, vram_gb, hardware_tier_id} for local-inference candidates
model_selected	object?	{slug, recommended_rank} — null if no selection
outcome_success	boolean?	User-reported success
failure_reason	string?	Failure reason if applicable
task_date	date	Date the task was created

Comparison dataset

Field	Type	Description
task_type	string	Primary task category
classification_schema_version	string	v0.7 or v0.8 — task_type enum used
model_a_slug	string	First model in comparison
model_b_slug	string	Second model in comparison
preferred	string	model_a \| model_b \| tie
preference_reason	string?	Reason for preference
task_date	date	Date of comparison

Licence

This dataset is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You are free to share and adapt the data for non-commercial purposes with attribution.

Built by The Good Ship · good-ship.co.uk