A regressor that returns a single number is a guess with confidence intervals erased. For most tabular ML before PredictLM, that erasure happens because the standard objective — mean squared error — collapses the predictive distribution to its mean. The shape gets thrown away.
PredictLM's regression head doesn't do that. It returns a 1024-bin discrete distribution over the target — a BarDistribution. This post explains how the head works, why a softmax over bins beats the more common Gaussian mean-variance head for in-context regression, and what the full distribution lets a downstream agent actually do.
The setup
For a regression problem with target y ∈ ℝ, we want a model that outputs not a point ŷ but a distribution p(y | x, context). Three common approaches:
- Point estimate — model outputs
ŷ. Trained with MSE. Throws away everything except the mean. - Gaussian head — model outputs
(μ, σ). Trained with Gaussian NLL. Assumes the predictive distribution is unimodal and symmetric. Often wrong. - Quantile / bin head — model outputs a soft distribution over a discretized target. Trained with cross-entropy against the binned ground truth. Makes no parametric assumption.
PredictLM uses option 3.
How the head works
At training time:
- Look at the marginal distribution of
yover a large reference dataset of synthetic + real tasks. - Choose 1024 bin boundaries by equi-frequency quantiles of that marginal. This gives bins that are dense where the data is dense and sparse in the tails — automatic adaptive resolution.
- For each training target
y_true, find the bin it falls into and treat that bin index as a classification label. - The regression head is a single linear layer projecting
d_model → 1024. Train it with cross-entropy.
At inference time:
# logits: shape (n_queries, 1024)
probs = softmax(logits, dim=-1)
# predictive mean — for backward-compat with point-estimate APIs
mean = (probs * bin_centers).sum(dim=-1)
# any quantile you want
median = quantile(probs, bin_edges, q=0.5)
ci_90 = (quantile(probs, bin_edges, q=0.05),
quantile(probs, bin_edges, q=0.95))The 1024 bins are enough to recover smooth-looking distributions for practical visualization, and the softmax structure means you can sample, take expectations under arbitrary utility functions, and detect multi-modal predictions.
Why softmax-over-bins beats Gaussian heads for in-context regression
Three reasons.
1. The predictive distribution is often multi-modal
In-context learning means the model is inferring the dataset's structure on the fly from a small context. When the context is ambiguous — e.g., 30 rows that could equally support two different relationships between features and target — the right predictive distribution is bimodal. A Gaussian head can't represent this. It picks a mean somewhere in between, with a wide variance that underrepresents the confidence in each mode.
A binned head represents bimodality natively. The downstream agent can detect "there are two peaks at 0.3 and 0.7" and route to a clarification question rather than committing to 0.5.
2. Calibration is structural, not a separate fix
Gaussian heads are notoriously overconfident — the variance term gets optimized to fit the training distribution and routinely underestimates tail risk. The standard fix is post-hoc Platt scaling or isotonic regression, which is a band-aid.
A binned head trained with cross-entropy is naturally calibrated by construction — the softmax outputs are direct probability estimates, and cross-entropy is a proper scoring rule for those probabilities. Our reliability diagrams show ECE < 0.04 on the OpenML benchmark with no post-hoc fix.
3. The head is task-agnostic in the loss
A nice side-effect: the regression head and the classification head are now the same operation — softmax over a discrete output space. The only difference is the output dimensionality and the bin-edge metadata. The optimizer sees one consistent objective, the codebase has one shared training loop, and the same calibration tooling works for both.
Three things you can do with a predictive distribution
Once you have p(y | x, context) instead of ŷ, three downstream patterns become available that simply aren't possible with a point estimate:
- Risk-aware decisions. Compute
E_p[u(y)]under an arbitrary utility functionu. For a loan-default prediction, you don't want the mean default probability — you want the expected loss given your bank's actual loss function, which is asymmetric. - Active sampling. Pick the queries with the highest predictive entropy. Label them. Add them to the context. This is essentially free Bayesian optimization with no special infrastructure.
- Hallucination detection inside agents. When an LLM calls PredictLM as a tool, the agent can read the predictive entropy. High entropy on a query the LLM is confident about is a signal that the model and the agent disagree — and the agent should ask for human review.
The cost
A 1024-dimensional softmax has overhead — both training memory (the cross-entropy loss matrix) and inference latency (sampling and quantile extraction). In our setup the latency cost is ~6% over a Gaussian head. Training memory is ~12% higher.
If you only ever consume the predictive mean, you're paying for capability you don't use. But the entire reason calibrated uncertainty is interesting is the agent-native use case, where someone will read the full distribution. The cost is worth it.
What we use it for, internally
Every PredictLM evaluation reports both point-estimate metrics (R², accuracy) and distributional metrics (CRPS for regression, ECE for classification). When we ship a new architecture, we look at both. A model that improves R² while worsening CRPS is overfitting the mean and we'd rather not ship it.
Further reading
- The original BarDistribution paper (Müller et al., 2024) for the formal treatment of why softmax-over-bins is a natural proper scoring rule.
- The TabPFN v2 model card for an earlier real-world deployment of binned regression heads in a tabular foundation model.
- Our own PredictLM-Mini release post for how this head ships in practice.