Episode 42 – Laurens Bensdorp - Building Strategies with Purpose

When Back-Tests Fail

There’s a special place in trading graveyards reserved for the back-test that looked gorgeous on paper and then detonated in production. I’ve been there. If you trade long enough, you will too. We all know the over-fittings issues, and I’ll get into that, but there’s another reason why back-tests can fail: the initial purpose is not matched to the right method. If we ask the wrong thing of the test, measure it with the wrong yardstick, or ignore how strategies behave when they’re combined (not consider portfolio impact), then failure is likely. Laurens Bensdorp has a major focus on this – nicely summarised as building strategies with purpose.

The punchline is simple: your objective should dictate everything—what you build, which metrics you care about, and how you size and combine the pieces. Or as Rich Brennan puts it succinctly in his fantastic recent appearance on Episode 361 of Top Traders Unplugged, “your objective dictates your design.”

1) Start with purpose, not with code

Most back-tests fail ex post because they were optimized for the wrong mission ex ante. Using trend following as an example, Rich breaks managers down into four archetypes that behave—and should be judged—very differently: replicators, core diversifiers, crisis risk offset managers, and outlier hunters. If you’re replicating the SG CTA index, tracking error probably matters more than headline Sharpe. If you’re a core diversifier inside a multi-asset book, focus on portfolio-level Sharpe and correlation. Crisis-risk offsets live or die on convexity during selloffs. Outlier hunters care about payoff asymmetry, skewness, and top-trade contribution, not whether the equity curve is smooth this quarter. The professor is bang-on as usual.

That framing changes design choices you’d otherwise debate forever on Twitter: breadth vs concentration; absolute momentum vs cross-sectional momentum; volatility targeting vs static small bets; symmetry vs asymmetry in rules. There is no universal “right.” There’s only right for your purpose.

For the retail trader I think the temptation is to believe that the only objective is high returns with minimal drawdowns, aka smooth equity curves. But as Laurens brings out, once you have strategy 1, the purpose and reason behind strategy 2 will begin to be refined. Naturally, you’ll want something different. As we also brought out in our chat, the things you are looking for and the way you think, measure and build one kind of strategy (like mean-reversion), will be completely different to how you’ll build another (like trend-following). To measure them by the same yardstick or to assume similar risks, will simply not work. Trying to build a trend strategy with a 70% win rate means, by definition, you’ll cut off the positive outliers that are otherwise expected to do all the heavy lifting in this kind of strategy.

As an example, if you are hunting fat tails, cutting exposure just as volatility explodes “clips the wings of [your] biggest winners.” As Rich says, “I want to ride the wave in full,” which is why he normalizes at entry and then leaves positions alone. Or another great example he raised is that if your edge comes from catching rare, explosive moves, maximum feasible breadth isn’t decoration—“it’s a core operating principle.” The cost of missing the one market that goes parabolic far outweighs years of mediocrity elsewhere.

Practical takeaway: Before you sit down to code, write down (a) the role this strategy must play in the portfolio, and (b) the metrics that best capture success in that role. Then build to those metrics and stop optimizing once the design meets them.

2) Ugly equity curves can be beautiful (in the right portfolio)

One costly back-test biases is fetishizing pretty equity curves. Laurens Bensdorp makes the opposite case: be willing to add a strategy with an ugly equity curve or even a negative standalone return if it fills a structural hole in your book (like hedging catastrophic failures). Most traders won’t do it precisely because it looks ugly—and that’s your edge.

Think hedging, long-vol, short-equity overlays—deliberately lossy in the median state of the world, but explosive when you need it. In his discussion with me on the show, Laurens points out that a long-vol component can “dramatically improve the overall result of your portfolio” when the long-only equity sleeve gets hit in lockstep. He even sketches a concrete long-vol rule using something like VXX with a breakout filter (e.g., a Keltner/ATR trigger).

If your objective includes insurance, then the metric shifts. Instead of “highest CAGR,” think “portfolio MAR during equity drawdowns,” “crisis convexity,” or “time to recovery.” Judge the hedge by its portfolio contribution on your worst 5% days, not by its solo back-test. You’ll stop throwing away “ugly” curves that are doing exactly what you hired them to do. Don’t fear – the excitement is still there: in your overall portfolio result, compounding your equity more efficiently and especially during chaos when others are suffering. Be wary of ignoring it just because you ignored it on the first run of your backtest. Don’t be a gambler.

Practical check: In every strategy review, include a sheet titled “Why this looks bad alone but strong together” and show: (1) correlation to the core sleeve in stress windows, (2) contribution to worst-day P&L, (3) improvement in peak-to-trough and time-to-recovery at the portfolio level.

3) Parrondo’s paradox: when two losers make a winner

In game-theory terms, Parrondo’s Paradox is seen when “a combination of losing strategies can become a winning strategy.” (Wikipedia) The finance literature shows that the mechanism can be mundane: diversification and switching rules can create favorable drift when payoffs are path-dependent. (Michael Stutzer’s work is a good starting point if you want a gentle, finance-flavored derivation.) (Leeds Faculty)

This isn’t magic; it’s interaction. The visual in the video interview was awesome: two descending “staircases” (two losing processes). Combine them, and the ball rises—the composite path drifts up because the patterns of wins and losses interleave favorably. That’s Parrondo in a nutshell, and it’s why we should obsess over how we combine strategies, not just over each strategy in isolation.

Where do traders go wrong?

They select only “individually great” systems, unknowingly correlating the failure modes.
They allocate capital by recent Sharpe, which often synchronizes the book of strategies—great right up to the regime break. Heed the warning of LTCM.
They never test allocation heuristics (equal risk, individual stock exposure, dynamic strategy allocation, static small bets, etc) as design choices in their own right.

4) Beware the seduction of pretty curves (over-fitting 101)

There’s an art to applying the science, Laurens mentioned. A strict set of rules for how to build and robustness-test a strategy is great, however one must know there will be different principles and methods at play for different strategies. Without understanding this you can end up overfitting to the noise rather than extracting an enduring signal.

Three practical anti-overfit policies could include:

Design-first, then test. Write down the mechanism you expect to harvest (trend persistence, mean reversion after liquidity shocks, term-structure convexity, etc.) before you see results. The design-first logic specifically helps to avoid the data-mining trap. Think about the trade-by-trade results (not the overall portfolio) and consider the importance of trade sequence to your particular model. In essence, do more analysis and less pressing the optimize button.
Simplicity beats cleverness (multiplied). Laurens calls it “multiplying simplicity”: build simple strategies with few parameters that are fundamentally different from each other, and let the combination do the “complex” work of smoothing the path.
Use the right robustness tools for the right style. Methods that reduce overfitting in convergent models (e.g., some Monte Carlo treatments) might not transfer cleanly to trend following (divergent models), where outliers are irregular and non-repeating. That’s one reason they argue trend followers could even use the full dataset judiciously instead of carving out conventional out-of-sample windows—because you don’t want to throw away your scarce tail events. (You can disagree with the prescription; the point is to align the validation method with the strategy’s signal ecology.)

Once the purpose is set, you can still follow various statistical principles:

Too many degrees of freedom per trade. If your parameter count rivals your annual trade count, you’re fitting noise!
Narrow regime dependence. All the edge comes from two crisis months? Assume it won’t repeat the same way.
No cross-market validity. If it only works on one contract but dies on close substitutes, it isn’t a robust process—it’s historical happenstance.
Optimization without budget. If you don’t cap how many knobs you’re allowed to tune (or how much improvement you’re willing to believe), you will inadvertently optimize on luck.

Another Trend Following Example from the Professor

In that TTU episode (361) Rich ran an interesting experiment: equal-weight a 68-market outlier-hunting portfolio with no hindsight and you get a respectable MAR ≈ 0.8 despite 27 unprofitable markets in-sample. Now randomly pick 300 different 30-market subsets from the same universe: only ~12% match or beat that MAR; median outcomes fall well below 0.5, with many survivability-threatening results at the lower quartile—purely because the 30 markets selected were not as lucky. 68 markets means more chance of catching the big trend, if that’s your objective.

So if you’re an outlier hunter: the very thing that makes your edge (fat-tail capture) also makes your realized performance highly path-dependent in finite samples. The antidote is obvious but operationally hard: more hooks in the water (trade more markets)—even if some markets look “quirky,” even if many contribute nothing for years—because one fat tail pays for their keep.

The Sharpe ratio isn’t going to capture this important metric and tell you “Sorry, you didn’t trade enough markets”, so Sharpe isn’t the thing you should be concerned about in this scenario. Your objective function should be married to your actual objectives; in this case, be open to opportunities wherever they may occur and hang on for dear life when they happen. This is going to preclude a fixation on “minimizing drawdown”. Starting to make sense?

Practical implications:

Don’t compare all strategies with the same metrics. Individual equity curves aren’t for ego trips, focus on the portfolio. Consider the archetype of your strategy and build, measure, test appropriately. To quote Rich again, lining up a bus, a sports car, and a motorbike, then ranking them on a racetrack lap time, is misleading. Ask “right for what objective?” first.

5) Building a purpose-first validation workflow

Here’s a possible check list you could use to ensure more trust in back-tests:

Clarify objective & metrics

Strategy Type / Archetype: What are we going for? What are the implications for things like trade count, commission drag, markets it should run on, where profits will be derived, what risks are present.
Primary metrics: If it’s a hedge strategy, well, how did it do in the crises obviously. If it’s a trend strategy we might be looking at skew, top-decile trade contribution, time under water, etc. If it’s a reversion strategy, I’m interested in individual trade payoff, win rate, clustering of trades, concentration risk, exposure during black swan events, etc. Align metrics to your mission.

Design first, then code

Document the economic/structural/behavioral logic the rule exploits.
List assumptions that must hold (microstructure, liquidity, regime features).
Specify when you expect the strategy to make money, and when it shouldn’t, before you see results. You shouldn’t be optimising parameters for the sake of it (max return say), but to identify vulnerabilities and make design choices that address these potholes.

Right-sized simplicity

Enforce a parameter budget (e.g., ≤3 tunables per entry/exit complex). Simplicity scales.
Favor orthogonal simple rules over one complex rule (“multiply simplicity,” as Laurens says). “If you do that in a logical way where you understand what the weakness is… you increase the risk adjusted return,” he said. In other words, embrace simple models but combine them thoughtfully.
If you are analysing the data for signals, rather than pushing it to deliver a ready-made strategy in a few steps, you will naturally be less prone to over-fitting because you are taking a ‘research-first’ approach. Good strategies exhibit plateaus in parameter space – broad regions where performance is relatively stable. The key is humility. A back test is a model of the past, not a prediction.

Robustness testing matched to strategy

For trend-following where signals are sparse and outliers irregular, prefer multi-market, multi-decade testing and perturbations (slippage shocks, delay entries, skip signals) over carving away scarce tails for OOS. The OOS data won’t be long enough to measure success.
For convergent or shorter-term styles, walk-forward, out of sample and alternative market tests might be a lot more useful. Monte-carlo tests which shuffle trade results are likely to only make things look safer than they are, because the risk for reversion traders is the synchronisation of trades during tail events! What’s the ‘crisis score’?
Expanding breadth (number of markets, strategies, non-correlated payoffs) is always beneficial. An ounce more breadth is worth far outweighs individual strategy tweaks. Do you have metric for ‘breadth-health’?
Pre-commit to acceptance criteria. Ensure logic reigns with parameter selection. Forget the best outcome, remember the worst. Be statistically minded, but know where your objectives will or won’t be captured by various metrics. Over-fitting can masquerade as statistical significance; don’t rely on code, press the logic.

Portfolio-aware evaluation

Run strategy-in-portfolio simulations against your existing stack, measuring the role it’s meant to play (drawdown relief, convexity, skew). Does it add to the portfolio overall?
Building orthogonal strategies, harvesting diversification, taking steps to better utilise and allocate capital are all free lunches that can compound returns without requiring you to be a better trader.

Luck accounting

For trend traders, use top-trade concentration and luck-adjusted ranges to show yourself what happens if you miss the one fat tail (and with less breadth, odds rise that you will).
For reversion traders, force tests to include the disastrous events that could have wiped the strategy out. Try to break it so that you’re not blind to the risks.

The point of the back-test is not to maximise an in-sample Sharpe; it is to reject fragile hypotheses. A robust research process deliberately tries to falsify its own ideas by exposing them to alternative samples, parameter perturbations and different market universes. Smooth curves are alluring; survivability is the edge.

Closing: Portfolio construction beats single-system cleverness

Back-tests fail because we treat them like verdicts rather than experiments. Be a better scientist. The remedy is to embed purpose at the centre of your design, judge sleeves by the role they play in the portfolio, and respect interaction effects that might even turn two losers into a winner! Laurens’s advice here is gold—embrace the hedge that looks bad alone if it rescues the whole when it matters.

And above all, keep Rich’s mantra close: objective → design → metrics. When you get that chain right, back-tests are just a research tool. The goal is to generate profits, which will happen if you let the back-test speak, rather than torture it till it tells you what you want to hear.

We traders love to search for recurring patterns, but markets are complex adaptive systems influenced by countless variables. Patterns can rhyme, but they rarely repeat exactly. When it comes to statistical significance, breadth is your friend. Running a broad range of strategies and parameter settings at least places you in a position where some of those will do great next year. That’s an objective that has statistical and logical validity and will yield better results over time than gambling on the one or two that worked best last year. Treat every result as a draw from a distribution, not a prophecy, and you’ll set the mind into the right place for proper strategy development.

Be purposeful about your diversification.

Survive first, thrive later.

Hope that helps!

Simon

Get in Touch with Laurens

Website

Linked In

Episode 42 – Laurens Bensdorp - Building Strategies with Purpose

When Back-Tests Fail

Related podcast

039 - Brett Steenbarger - Mental Keys to Quantitative Trading Success

024 – Trading Think Tank 02 – Battle of the Back-Testers

011 - Into the Quant Lab - Fooled by Randomness