Verification Depth — How Long a Track Record Has to Be to Mean Something
30 days of trading is statistical noise. Here is the math behind sample-size requirements and how NakedPnL's Bronze, Silver, and Gold tiers are calibrated.
- A 30-day track record cannot statistically distinguish a skilled trader from a lucky one with any meaningful confidence.
- Lo (2002) shows the standard error of an annualized Sharpe ratio scales as 1/sqrt(T) — short windows have wide confidence intervals.
- 12 months is the minimum at which a strong Sharpe (1.5+) becomes detectable above noise; weaker Sharpes need years.
- NakedPnL's Bronze/Silver/Gold tiers map to verified days, linked venues, and continuity — not to performance.
- The depth tier is descriptive of evidence weight, never of trader quality. A Gold tier with poor returns is still poor returns.
Length of track record is the single most undervalued piece of context in published trading performance. A trader with three months of impressive returns and a trader with three years of steadier returns are routinely placed side by side in the same ranking, as if the numbers were equally meaningful. They are not. The statistical confidence interval around the shorter trader's return is roughly six times wider, which usually means the published number is consistent with anything from genuine skill to pure luck.
This article explains why sample-size matters in performance measurement, what the academic literature says about minimum track-record duration, and how NakedPnL's verification depth tiers (Bronze, Silver, Gold) operationalize those requirements.
Why short tracks lie
Trading returns are noisy. Even a trader with zero edge will produce winning weeks and losing weeks at random; over short horizons, the noise dwarfs any underlying skill signal. The statistical question is whether the observed performance is large enough relative to the noise to credibly demonstrate that the trader has skill, as opposed to having been on the right side of a coin flip.
The standard tool for this is the t-statistic of the Sharpe ratio. Lo (2002), in 'The Statistics of Sharpe Ratios', derived the asymptotic distribution of the sample Sharpe and showed that for a true Sharpe S over T years of measurement (with daily observations), the standard error of the estimate is approximately:
SE(Sharpe_hat) ~= sqrt((1 + 0.5 * S^2) / T)
where T is the number of years observed and S is the (unknown) true Sharpe.The implication is that detecting a true Sharpe of S above zero with 95% confidence requires a track record long enough that 1.96 × SE(Sharpe) is less than S. Solving the inequality gives a minimum sample size:
T_min ~= 1.96^2 * (1 + 0.5 * S^2) / S^2
For S = 0.5: T_min ~= 19.2 years
For S = 1.0: T_min ~= 5.8 years
For S = 1.5: T_min ~= 3.0 years
For S = 2.0: T_min ~= 2.0 years
For S = 3.0: T_min ~= 1.0 yearsThe headline number: detecting an extraordinary Sharpe of 2 — the kind of risk-adjusted performance institutional allocators write headlines about — takes about two years of clean daily data to confirm at standard significance levels. Detecting a more typical good-but-not-spectacular Sharpe of 1 takes nearly six years. A 30-day window cannot reliably detect anything short of a Sharpe well above 5, which essentially does not occur in real trading data.
The trade-off between length and recency
Longer is better statistically, but longer also means older — and trading regimes change. The 2017 ICO bull market, the 2018 winter, the 2020 pandemic-driven volatility, the 2021 leverage cascade, and the 2022 Luna implosion all produced very different return distributions. A 10-year Sharpe is a precise estimate of an average that no longer reflects current conditions.
There is no universal answer to the trade-off. The mainstream institutional convention is that 3–5 years of clean monthly data is the minimum viable track record for a quantitative manager, with daily data offering meaningfully better statistical power per calendar year. For crypto specifically, the rapid regime shifts mean that even 5-year records often span structurally different markets, so the within-regime sample is often smaller than the calendar duration suggests.
Sample size requirements by Sharpe ratio
| True Sharpe (S) | Years for p < 0.05 | Years for p < 0.01 | Trading days at 252/yr (p < 0.05) |
|---|---|---|---|
| 0.25 | 108 | 186 | 27,200 |
| 0.50 | 19 | 33 | 4,800 |
| 0.75 | 9.4 | 16 | 2,360 |
| 1.00 | 5.8 | 10 | 1,460 |
| 1.50 | 3.0 | 5.1 | 750 |
| 2.00 | 2.0 | 3.4 | 500 |
| 2.50 | 1.5 | 2.5 | 370 |
| 3.00 | 1.0 | 1.8 | 280 |
The table illustrates two important properties. First, the cost of detecting modest skill (Sharpe < 1) in human-relevant timeframes is enormous — most working careers are too short. Second, even very strong Sharpes (above 2) require at minimum a full year of clean daily data, which is far longer than the typical influencer track record on display.
What track record length does and does not prove
Length of record is necessary but not sufficient. A 5-year clean track record gives the statistical power to distinguish skill from noise; it does not by itself prove the trader is skilled. The same long sample could in principle have been produced by a long run of luck, by survivorship from a much larger cohort, or by a strategy that worked in the historical regime but is currently broken.
What length proves is evidentiary weight: a longer sample means the published Sharpe and TWR have narrower confidence intervals. It is a multiplier on whatever signal exists, not a substitute for the signal. A trader with 5 years of mediocre performance is mediocre with high confidence. A trader with 5 years of strong performance is strong with high confidence. A trader with 30 days of strong performance is unknown.
How NakedPnL's depth tiers are calibrated
NakedPnL's verification depth (lib/verification-depth.ts) computes a tier — Bronze, Silver, or Gold — for every trader based on three orthogonal axes: how many days of NAV snapshots have been collected, how many distinct exchange venues are linked, and how continuous the snapshot record is. The tier is a measure of evidence weight, not of returns.
| Tier | Verified days | Linked venues | Max gap (days) |
|---|---|---|---|
| GOLD | >= 180 | >= 3 | <= 7 |
| SILVER | >= 90 OR linked venues >= 2 | >= 1 | (none) |
| BRONZE | >= 1 (and below SILVER) | >= 1 | (none) |
| UNVERIFIED | 0 | 0 | (none) |
Gold tier requires 180 verified days (roughly 6 months), 3 linked venues, and no snapshot gap exceeding 7 days. Silver requires either 90 verified days or 2 linked venues. Bronze is the entry tier — any trader with at least one snapshot but below Silver thresholds. Unverified means the account has been registered but no snapshots have yet been collected.
Why those specific thresholds
180 days is the minimum at which Lo's Sharpe-detection table starts to bite for any realistic Sharpe ratio. A Sharpe of 2 needs about 500 trading days for confident detection at p < 0.05; 180 calendar days (~125 trading days) is short of that, but it is the threshold at which a strong Sharpe begins to clear the noise floor. Below 180 the published returns are descriptive of a sample, not yet of a strategy.
The 3-venue requirement is independent of statistical sample size. It is a robustness condition: if a trader's reported alpha is a venue-specific artifact (an exchange's quirk, an API misclassification, a one-off rebate), connecting three venues forces the alpha to either show up consistently across them or be flagged as venue-specific. It is not a measure of more trading; it is a measure of having multiple redundant sources of evidence.
The 7-day gap ceiling enforces continuity. A trader who repeatedly disconnects their API for weeks at a time can selectively show only their good periods — a soft form of survivorship bias at the personal level. Capping the maximum gap at one week makes that practice impossible without losing the Gold tier.
What the tiers do not measure
- Returns. A Gold-tier trader can have terrible returns. A Bronze-tier trader can have great returns. The tier reflects evidence depth, not outcomes.
- Strategy quality. The depth tier does not evaluate whether the underlying approach is sensible, sustainable, or replicable.
- Risk-adjusted performance. Sharpe, Sortino, and similar ratios are not used to determine the tier — they cannot be displayed publicly under NakedPnL's strip-back analytics policy.
- Future performance. A long, clean past is statistical evidence that the published numbers are not noise. It is not a forecast.
How depth interacts with chain integrity
Each NAV snapshot that contributes to verified-day count is also an entry in the trader's hash chain (lib/calculation/audit-hash.ts). The chain is append-only, so days that have been counted toward a depth tier cannot later be removed without breaking the SHA-256 chain hash and the Bitcoin OpenTimestamps anchor. A Gold trader cannot quietly trim their bad weeks out of the underlying record.
This is the structural reason the depth tier is meaningful at all. Without an append-only chain, a trader could pad their verified day count with a long string of recent good days while privately deleting earlier bad ones. The append-only constraint forces the day count to be a property of the genuine historical record.
How to read a track record critically
When evaluating any published track record — on NakedPnL or anywhere else — three questions are decisive:
- How long is the record? Less than a year is preliminary; less than two years is suggestive; three years or more begins to be statistically meaningful for typical Sharpes.
- How continuous is the record? Frequent gaps suggest selective publication. A continuous chain with no unexplained absences is much stronger evidence.
- Is the record append-only? Can the trader silently revise or remove entries after the fact? An independently re-verifiable hash chain anchored to a public timestamping system rules this out.
NakedPnL's verification depth tier surfaces all three properties in a single label. A Gold tier means the trader has cleared the basic statistical threshold, has multiple independent venues of evidence, and has not skipped weeks of unfavourable history. It is the minimum bar at which a published track record is worth evaluating in detail.
Frequently asked questions
References
- Lo, A. (2002). The Statistics of Sharpe Ratios. Financial Analysts Journal, Vol. 58, No. 4.
- Bailey, D., Lopez de Prado, M. (2012). The Sharpe Ratio Efficient Frontier. Journal of Risk.
- GIPS Standards for Firms (2020) — Composite Construction Requirements
- Harvey, C., Liu, Y., Zhu, H. (2016). … and the Cross-Section of Expected Returns. Review of Financial Studies.