The Loss Function That Doesn't Panic
Contents
The Loss Function That Doesn’t Panic
Photo: MFO, CC BY-SA 2.0 DE
One bad reading – an extensometer slip, a sensor spike, someone nudging the rig – and your entire L2 fit chases it. Elastic modulus wrong, yield point shifted, whole model garbage. The root cause is that L2 squares the error, so a 70 MPa outlier gets 100x more weight than a clean 7 MPa reading. It doesn’t matter whether you’re fitting tensile data, concrete strength, or fatigue curves: outliers are not rare edge cases, they are Tuesday.
Huber loss is the fix nobody taught you in school. Been around since 1964, in every major framework, needs exactly one extra number from you.
See It First
Same dataset. Same trend. Three chaos agents disguised as data points. Drag them around or hit Animate and watch L2 (red) absolutely lose its mind while Huber (green) stays calm.
The bar chart shows the effective weight each loss assigns to every point. Three bars per point: red (L2), blue (L1), green (Huber). Full height means fully trusted, near zero means ignored. L2 trusts everything equally – bad readings included. Huber crushes the outliers and leaves clean points at full weight. L1 sits in between. Lower delta to make Huber more aggressive.
Why L2 Fails
Let $e_i = y_i - \hat{y}_i$ be the residual for point $i$ (predicted minus actual). Square them:
\[E_{\text{L2}} = \sum_i e_i^2\]Smooth, fast, closed-form solution – the universal default for good reason. The problem is squaring: a 70 MPa residual contributes 4900, a 700 MPa outlier contributes 490,000. There is no ceiling on how much a single bad point can dominate.
Why L1 Is Better But Annoying
Use absolute values instead:
\[E_{\text{L1}} = \sum_i |e_i|\]Outliers now grow linearly, not quadratically. But the absolute value function has a corner at zero – the derivative does not exist there – so standard gradient-based solvers break. You need specialized linear-programming methods instead.
Huber: The Best of Both
Switch behavior at a threshold delta:
\[H_\delta(e_i) = \begin{cases} \dfrac{1}{2}e_i^2 & |e_i| \leq \delta \\[8pt] \delta\,|e_i| - \dfrac{1}{2}\delta^2 & |e_i| > \delta \end{cases}\]| Inside delta: quadratic, identical to L2 – smooth and differentiable, standard optimizers work fine. Outside delta: linear, identical to L1 – outliers can’t blow up the total. The function is continuous with a continuous first derivative at the transition point $ | e_i | = \delta$, so no special solver is needed. |
The derivative is the real story:
Bottom panel: L1 (blue) jumps from -1 to +1 at zero with no defined value in between – that corner breaks gradient solvers. L2 (red) grows without bound – gradient at 700 MPa is 700, at 7000 MPa is 7000, no ceiling. Huber (green) matches L2 near zero, then clamps to ±δ beyond the threshold. Drag the slider and watch the green line’s flat region widen or narrow.
How to Pick Delta
Delta is the boundary between “normal noise” and “suspicious outlier.” Set it to your expected clean measurement noise. Sensor noise around 15 MPa? Use delta = 15.
The standard data-driven approach: run a quick L2 fit first, compute the residual standard deviation $\hat{\sigma}$, then set $\delta = 1.345\,\hat{\sigma}$. This gives 95% asymptotic efficiency on clean Gaussian data – meaning you lose almost nothing compared to L2 when data is clean – while staying robust when it isn’t. Most libraries use this as their default.
In Python (NumPy only)
import numpy as np
def huber_fit(x, y, delta=1.0, iters=50):
# Start from ordinary least squares
A = np.column_stack([np.ones_like(x), x])
coeffs, _, _, _ = np.linalg.lstsq(A, y, rcond=None)
for _ in range(iters):
residuals = y - A @ coeffs
# Weight = 1 for small errors, delta/|e| for large ones
weights = np.where(np.abs(residuals) <= delta,
1.0,
delta / np.abs(residuals))
W = np.diag(weights)
# Weighted least squares: (A^T W A)^{-1} A^T W y
coeffs = np.linalg.solve(A.T @ W @ A, A.T @ W @ y)
return coeffs[0], coeffs[1] # intercept, slope
x = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100], dtype=float)
y = np.array([68, 88, 102, 124, 141, 158, 178, 196, 214, 230], dtype=float)
y[3] = 400.0 # someone had a bad day
intercept, slope = huber_fit(x, y, delta=20.0)
print(f"Huber fit: y = {intercept:.1f} + {slope:.2f}x")
Honorable Mentions
When Huber still is not enough:
Tukey bisquare – beyond threshold $c$, loss becomes constant. Outliers contribute exactly zero gradient; the optimizer stops listening to them entirely. More powerful than Huber, but non-convex: requires a good initialization (e.g. an L2 or Huber pre-fit) or it can converge to a wrong solution.
Cauchy loss – grows logarithmically. Better than Huber when outliers are frequent rather than rare, such as heavy-tailed sensor noise.
Barron adaptive loss (2019) – a single parametric family that includes L2, Huber, Cauchy, and Welsch as special cases. The shape parameter $\alpha$ can be learned from data, so the model selects its own robustness level without manual delta tuning.
Basem Rajjoub