Public Dataset
What people want to use AI for, and which models work
What's in the dataset
Every time someone uses Bearing, we record the task classification (type, subtype, complexity), the user's priority ranking, which models were recommended and at what scores, which model the user selected, and -- optionally -- whether it worked.
We also publish head-to-head comparison data: which two models were compared, and which one the user preferred.
Why this matters
There is no existing public dataset of real-world “task → model → did it work?” decisions. Benchmarks test raw capability; Bearing tests fit -- whether a model is the right choice for what someone actually wants to do.
This data is useful for anyone building routing systems, recommendation engines, or evaluation tools for AI models.
Download
Recommendation data
Task classifications, model recommendations, selections, and outcomes.
Methodology
- Task classification: Claude Haiku with a confidence threshold of 0.6. Tasks below this threshold go through a clarification step.
- Scoring: 7-factor weighted scoring based on user-ranked priorities. See About for details.
- Selection signal: which model the user chose and at what rank in the recommendation list.
- Outcome signal: optional thumbs up/down with structured failure reasons (e.g. quality, speed, cost, hallucination).
Schema
Recommendation dataset
| Field | Type | Description |
|---|---|---|
| task_type | string | Primary task category |
| task_subtype | string | Specific task sub-category |
| complexity | string | low | medium | high |
| input_length | string | short | medium | long | very_long |
| needs_vision | boolean | Requires image/vision capabilities |
| needs_tools | boolean | Requires tool use / function calling |
| needs_code | boolean | Requires code generation |
| is_recurring | boolean | Recurring or repeated task |
| priority_order | string[] | User-ranked priority factors |
| models_recommended | object[] | {slug, rank, weighted_score} |
| model_selected | object | {slug, recommended_rank} |
| outcome_success | boolean? | User-reported success |
| failure_reason | string? | Failure reason if applicable |
| task_date | date | Date the task was created |
Comparison dataset
| Field | Type | Description |
|---|---|---|
| task_type | string | Primary task category |
| model_a_slug | string | First model in comparison |
| model_b_slug | string | Second model in comparison |
| preferred | string | model_a | model_b | tie |
| preference_reason | string? | Reason for preference |
| task_date | date | Date of comparison |
Licence
This dataset is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You are free to share and adapt the data for non-commercial purposes with attribution.
Built by The Good Ship · good-ship.co.uk