BarDistribution: why every regressor should return a distribution

A regressor that returns a single number is a guess with confidence intervals erased. For most tabular ML before PredictLM, that erasure happens because the standard objective — mean squared error — collapses the predictive distribution to its mean. The shape gets thrown away.

PredictLM's regression head doesn't do that. It returns a 1024-bin discrete distribution over the target — a BarDistribution. This post explains how the head works, why a softmax over bins beats the more common Gaussian mean-variance head for in-context regression, and what the full distribution lets a downstream agent actually do.

The setup

For a regression problem with target y ∈ ℝ, we want a model that outputs not a point ŷ but a distribution p(y | x, context). Three common approaches:

Point estimate — model outputs ŷ. Trained with MSE. Throws away everything except the mean.
Gaussian head — model outputs (μ, σ). Trained with Gaussian NLL. Assumes the predictive distribution is unimodal and symmetric. Often wrong.
Quantile / bin head — model outputs a soft distribution over a discretized target. Trained with cross-entropy against the binned ground truth. Makes no parametric assumption.

PredictLM uses option 3.

How the head works

At training time:

Look at the marginal distribution of y over a large reference dataset of synthetic + real tasks.
Choose 1024 bin boundaries by equi-frequency quantiles of that marginal. This gives bins that are dense where the data is dense and sparse in the tails — automatic adaptive resolution.
For each training target y_true, find the bin it falls into and treat that bin index as a classification label.
The regression head is a single linear layer projecting d_model → 1024. Train it with cross-entropy.

At inference time:

# logits: shape (n_queries, 1024)
probs = softmax(logits, dim=-1)
 
# predictive mean — for backward-compat with point-estimate APIs
mean = (probs * bin_centers).sum(dim=-1)
 
# any quantile you want
median = quantile(probs, bin_edges, q=0.5)
ci_90 = (quantile(probs, bin_edges, q=0.05),
         quantile(probs, bin_edges, q=0.95))

The 1024 bins are enough to recover smooth-looking distributions for practical visualization, and the softmax structure means you can sample, take expectations under arbitrary utility functions, and detect multi-modal predictions.

Why softmax-over-bins beats Gaussian heads for in-context regression

Three reasons.

In-context learning means the model is inferring the dataset's structure on the fly from a small context. When the context is ambiguous — e.g., 30 rows that could equally support two different relationships between features and target — the right predictive distribution is bimodal. A Gaussian head can't represent this. It picks a mean somewhere in between, with a wide variance that underrepresents the confidence in each mode.

A binned head represents bimodality natively. The downstream agent can detect "there are two peaks at 0.3 and 0.7" and route to a clarification question rather than committing to 0.5.

2. Calibration is structural, not a separate fix

Gaussian heads are notoriously overconfident — the variance term gets optimized to fit the training distribution and routinely underestimates tail risk. The standard fix is post-hoc Platt scaling or isotonic regression, which is a band-aid.

A binned head trained with cross-entropy is naturally calibrated by construction — the softmax outputs are direct probability estimates, and cross-entropy is a proper scoring rule for those probabilities. Our reliability diagrams show ECE < 0.04 on the OpenML benchmark with no post-hoc fix.

3. The head is task-agnostic in the loss

A nice side-effect: the regression head and the classification head are now the same operation — softmax over a discrete output space. The only difference is the output dimensionality and the bin-edge metadata. The optimizer sees one consistent objective, the codebase has one shared training loop, and the same calibration tooling works for both.

Three things you can do with a predictive distribution

Once you have p(y | x, context) instead of ŷ, three downstream patterns become available that simply aren't possible with a point estimate:

Risk-aware decisions. Compute E_p[u(y)] under an arbitrary utility function u. For a loan-default prediction, you don't want the mean default probability — you want the expected loss given your bank's actual loss function, which is asymmetric.
Active sampling. Pick the queries with the highest predictive entropy. Label them. Add them to the context. This is essentially free Bayesian optimization with no special infrastructure.
Hallucination detection inside agents. When an LLM calls PredictLM as a tool, the agent can read the predictive entropy. High entropy on a query the LLM is confident about is a signal that the model and the agent disagree — and the agent should ask for human review.

The cost

A 1024-dimensional softmax has overhead — both training memory (the cross-entropy loss matrix) and inference latency (sampling and quantile extraction). In our setup the latency cost is ~6% over a Gaussian head. Training memory is ~12% higher.

If you only ever consume the predictive mean, you're paying for capability you don't use. But the entire reason calibrated uncertainty is interesting is the agent-native use case, where someone will read the full distribution. The cost is worth it.

What we use it for, internally

Every PredictLM evaluation reports both point-estimate metrics (R², accuracy) and distributional metrics (CRPS for regression, ECE for classification). When we ship a new architecture, we look at both. A model that improves R² while worsening CRPS is overfitting the mean and we'd rather not ship it.

BarDistribution: why every regressor should return a distribution

The setup

How the head works

Why softmax-over-bins beats Gaussian heads for in-context regression

2. Calibration is structural, not a separate fix

3. The head is task-agnostic in the loss

Three things you can do with a predictive distribution

The cost

What we use it for, internally

Further reading

Six architectural experiments. Five lost. Here's what we shipped.

PredictLM-Mini: a 13M-parameter tabular foundation model with calibrated uncertainty

PredictLM v1: 0.751 cls / 0.609 reg on OpenML via test-time training

The setup

How the head works

Why softmax-over-bins beats Gaussian heads for in-context regression

1. The predictive distribution is often multi-modal

2. Calibration is structural, not a separate fix

3. The head is task-agnostic in the loss

Three things you can do with a predictive distribution

The cost

What we use it for, internally

Further reading

Six architectural experiments. Five lost. Here's what we shipped.

PredictLM-Mini: a 13M-parameter tabular foundation model with calibrated uncertainty

PredictLM v1: 0.751 cls / 0.609 reg on OpenML via test-time training