Public Dataset
What people want to use AI for, and which models work
What's in the dataset
Every time someone uses Bearing, we record the task classification (type, subtype, complexity), the user's priority ranking and the normalised factor weights the recommender actually applied, which models were recommended and at what scores, any multi-stage pipeline plan, which model the user selected (if any), and -- optionally -- whether it worked. Tasks that reached the recommendation stage are included even when the user did not pick a model.
We also publish head-to-head comparison data: which two models were compared, and which one the user preferred.
Why this matters
There is no existing public dataset of real-world “task → model → did it work?” decisions. Benchmarks test raw capability; Bearing tests fit -- whether a model is the right choice for what someone actually wants to do.
This data is useful for anyone building routing systems, recommendation engines, or evaluation tools for AI models.
Download
Recommendation data
Task classifications, model recommendations, selections, and outcomes.
Methodology
- Task classification: Claude Haiku with a confidence threshold of 0.6. Tasks below this threshold go through a clarification step.
- Scoring: 7-factor weighted scoring based on user-ranked priorities. See About for details.
- Selection signal: which model the user chose and at what rank in the recommendation list.
- Outcome signal: optional thumbs up/down with structured failure reasons (e.g. quality, speed, cost, hallucination).
Schema
Recommendation dataset
| Field | Type | Description |
|---|---|---|
| task_type | string | Primary task category |
| task_subtype | string | Specific task sub-category |
| complexity | string | low | medium | high |
| input_length | string | short | medium | long | very_long |
| needs_vision | boolean | Requires image/vision capabilities |
| needs_tools | boolean | Requires tool use / function calling |
| needs_code | boolean | Requires code generation |
| needs_reasoning | boolean | Requires multi-step reasoning |
| is_recurring | boolean | Recurring or repeated task |
| mode | string | recommend | pipeline | validate |
| priority_order | string[] | User-ranked priority factors |
| excluded_factors | string[] | Factors the user opted out of |
| factor_weights | object? | Normalised per-factor weights actually applied |
| pipeline_stages | object[]? | Multi-stage plan if recommended |
| classification_schema_version | string | v0.7 or v0.8 — task_type enum used |
| models_recommended | object[] | {slug, rank, weighted_score} |
| local_recommendations | object[] | {slug, rank, effective_quality, quant, vram_gb, hardware_tier_id} for local-inference candidates |
| model_selected | object? | {slug, recommended_rank} — null if no selection |
| outcome_success | boolean? | User-reported success |
| failure_reason | string? | Failure reason if applicable |
| task_date | date | Date the task was created |
Comparison dataset
| Field | Type | Description |
|---|---|---|
| task_type | string | Primary task category |
| classification_schema_version | string | v0.7 or v0.8 — task_type enum used |
| model_a_slug | string | First model in comparison |
| model_b_slug | string | Second model in comparison |
| preferred | string | model_a | model_b | tie |
| preference_reason | string? | Reason for preference |
| task_date | date | Date of comparison |
Licence
This dataset is released under Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). You are free to share and adapt the data for non-commercial purposes with attribution.
Built by The Good Ship · good-ship.co.uk