Methodology guide

Verification Depth — How Long a Track Record Has to Be to Mean Something

Q: Why isn't 30 days enough to evaluate a trader?

Because the standard error of the Sharpe ratio scales as 1/sqrt(T). With only 30 trading days, the 95% confidence interval around an observed Sharpe of even 3 is roughly +/- 1, which spans the difference between 'extraordinary' and 'merely good'. For more typical Sharpe values around 1, the 30-day confidence interval entirely contains zero — the data cannot reject the hypothesis that the trader has no skill at all.

Q: What is the minimum track record NakedPnL would consider Gold tier?

180 verified days, 3 linked exchange venues, and no snapshot gap exceeding 7 days. Those are AND conditions: a trader missing any one of them is at most Silver. The 180-day threshold corresponds to roughly 6 months of daily NAV data, which is the point at which the statistical noise floor begins to clear for strong Sharpe ratios.

Q: Can a trader get to Gold by adding venues without trading more?

No. The 3-venue requirement is necessary but not sufficient. The 180-day requirement applies independently. A trader with 4 connected venues and 60 days of snapshots is Silver, not Gold, because the day count fails the threshold. The system requires evidence on all three axes.

Q: Why isn't the Sharpe ratio used to determine the tier?

Because Sharpe ratio is a performance metric, and the depth tier is an evidence metric. NakedPnL's strip-back analytics policy specifically does not display Sharpe, Sortino, Calmar, or other risk-adjusted measures publicly — those are calculated for the admin layer only. The user-facing depth tier reflects how much evidence has been collected, not how good the trader is. A bad trader with a long clean record is Gold tier; the tier guarantees the badness is well-measured, not that there is no badness.

Q: What happens if a trader's snapshot record has a gap exceeding 7 days?

They cannot achieve Gold tier as long as the gap remains the maximum gap in their record. New continuous data does not erase old gaps — those gaps are permanently part of the chain. A trader who had a 30-day gap two years ago cannot reach Gold even if everything since has been continuous. This is intentional: it prevents traders from quietly skipping bad periods early in their tenure and then earning Gold later.

Q: Are the Lo (2002) sample-size formulas exact?

They are asymptotic approximations. For very short windows (T < 50 observations) the actual distribution of the sample Sharpe departs measurably from the normal approximation, and exact bounds require simulation or the corrected formulas in Lo's paper. For the timeframes relevant to NakedPnL's depth tiers (90+ days), the asymptotic formulas are accurate to within a few percent of the true required sample size, which is sufficient for tier calibration.

Q: Why include linked venues in the depth tier at all?

Because reported alpha can be venue-specific. A trader who looks profitable on Binance but never connects Bybit might be exploiting a quirk, a rebate, or a misclassified API field. Forcing 3 venues for Gold tier requires the apparent skill to either reproduce across exchanges or be flagged as venue-dependent. It is a robustness check on the underlying signal, not a measure of trade volume.

30 days of trading is statistical noise. Here is the math behind sample-size requirements and how NakedPnL's Bronze, Silver, and Gold tiers are calibrated.

By NakedPnL Research·May 7, 2026·13 min read

TL;DR

A 30-day track record cannot statistically distinguish a skilled trader from a lucky one with any meaningful confidence.
Lo (2002) shows the standard error of an annualized Sharpe ratio scales as 1/sqrt(T) — short windows have wide confidence intervals.
12 months is the minimum at which a strong Sharpe (1.5+) becomes detectable above noise; weaker Sharpes need years.
NakedPnL's Bronze/Silver/Gold tiers map to verified days, linked venues, and continuity — not to performance.
The depth tier is descriptive of evidence weight, never of trader quality. A Gold tier with poor returns is still poor returns.

Length of track record is the single most undervalued piece of context in published trading performance. A trader with three months of impressive returns and a trader with three years of steadier returns are routinely placed side by side in the same ranking, as if the numbers were equally meaningful. They are not. The statistical confidence interval around the shorter trader's return is roughly six times wider, which usually means the published number is consistent with anything from genuine skill to pure luck.

This article explains why sample-size matters in performance measurement, what the academic literature says about minimum track-record duration, and how NakedPnL's verification depth tiers (Bronze, Silver, Gold) operationalize those requirements.

Why short tracks lie

Trading returns are noisy. Even a trader with zero edge will produce winning weeks and losing weeks at random; over short horizons, the noise dwarfs any underlying skill signal. The statistical question is whether the observed performance is large enough relative to the noise to credibly demonstrate that the trader has skill, as opposed to having been on the right side of a coin flip.

The standard tool for this is the t-statistic of the Sharpe ratio. Lo (2002), in 'The Statistics of Sharpe Ratios', derived the asymptotic distribution of the sample Sharpe and showed that for a true Sharpe S over T years of measurement (with daily observations), the standard error of the estimate is approximately:

SE(Sharpe_hat) ~= sqrt((1 + 0.5 * S^2) / T)

where T is the number of years observed and S is the (unknown) true Sharpe.

Lo (2002), Equation 14. Holds for IID returns; the formula generalizes for autocorrelated returns with a worse constant.

The implication is that detecting a true Sharpe of S above zero with 95% confidence requires a track record long enough that 1.96 × SE(Sharpe) is less than S. Solving the inequality gives a minimum sample size:

T_min ~= 1.96^2 * (1 + 0.5 * S^2) / S^2

For S = 0.5:  T_min ~= 19.2 years
For S = 1.0:  T_min ~= 5.8 years
For S = 1.5:  T_min ~= 3.0 years
For S = 2.0:  T_min ~= 2.0 years
For S = 3.0:  T_min ~= 1.0 years

Approximate years required to distinguish a given true Sharpe from zero at 95% confidence with daily IID returns.

The headline number: detecting an extraordinary Sharpe of 2 — the kind of risk-adjusted performance institutional allocators write headlines about — takes about two years of clean daily data to confirm at standard significance levels. Detecting a more typical good-but-not-spectacular Sharpe of 1 takes nearly six years. A 30-day window cannot reliably detect anything short of a Sharpe well above 5, which essentially does not occur in real trading data.

The trade-off between length and recency

Longer is better statistically, but longer also means older — and trading regimes change. The 2017 ICO bull market, the 2018 winter, the 2020 pandemic-driven volatility, the 2021 leverage cascade, and the 2022 Luna implosion all produced very different return distributions. A 10-year Sharpe is a precise estimate of an average that no longer reflects current conditions.

There is no universal answer to the trade-off. The mainstream institutional convention is that 3–5 years of clean monthly data is the minimum viable track record for a quantitative manager, with daily data offering meaningfully better statistical power per calendar year. For crypto specifically, the rapid regime shifts mean that even 5-year records often span structurally different markets, so the within-regime sample is often smaller than the calendar duration suggests.

Sample size requirements by Sharpe ratio

True Sharpe (S)	Years for p < 0.05	Years for p < 0.01	Trading days at 252/yr (p < 0.05)
0.25	108	186	27,200
0.50	19	33	4,800
0.75	9.4	16	2,360
1.00	5.8	10	1,460
1.50	3.0	5.1	750
2.00	2.0	3.4	500
2.50	1.5	2.5	370
3.00	1.0	1.8	280

Approximate sample size required to reject the null hypothesis Sharpe = 0 in favor of the indicated true Sharpe, under Lo (2002) IID assumptions.

The table illustrates two important properties. First, the cost of detecting modest skill (Sharpe < 1) in human-relevant timeframes is enormous — most working careers are too short. Second, even very strong Sharpes (above 2) require at minimum a full year of clean daily data, which is far longer than the typical influencer track record on display.

What track record length does and does not prove

Length of record is necessary but not sufficient. A 5-year clean track record gives the statistical power to distinguish skill from noise; it does not by itself prove the trader is skilled. The same long sample could in principle have been produced by a long run of luck, by survivorship from a much larger cohort, or by a strategy that worked in the historical regime but is currently broken.

What length proves is evidentiary weight: a longer sample means the published Sharpe and TWR have narrower confidence intervals. It is a multiplier on whatever signal exists, not a substitute for the signal. A trader with 5 years of mediocre performance is mediocre with high confidence. A trader with 5 years of strong performance is strong with high confidence. A trader with 30 days of strong performance is unknown.

How NakedPnL's depth tiers are calibrated

NakedPnL's verification depth (lib/verification-depth.ts) computes a tier — Bronze, Silver, or Gold — for every trader based on three orthogonal axes: how many days of NAV snapshots have been collected, how many distinct exchange venues are linked, and how continuous the snapshot record is. The tier is a measure of evidence weight, not of returns.

Tier	Verified days	Linked venues	Max gap (days)
GOLD	>= 180	>= 3	<= 7
SILVER	>= 90 OR linked venues >= 2	>= 1	(none)
BRONZE	>= 1 (and below SILVER)	>= 1	(none)
UNVERIFIED	0	0	(none)

Depth tier thresholds in lib/verification-depth.ts. A trader must satisfy ALL conditions for their tier; failure to meet any GOLD condition demotes to SILVER.

Gold tier requires 180 verified days (roughly 6 months), 3 linked venues, and no snapshot gap exceeding 7 days. Silver requires either 90 verified days or 2 linked venues. Bronze is the entry tier — any trader with at least one snapshot but below Silver thresholds. Unverified means the account has been registered but no snapshots have yet been collected.

Why those specific thresholds

180 days is the minimum at which Lo's Sharpe-detection table starts to bite for any realistic Sharpe ratio. A Sharpe of 2 needs about 500 trading days for confident detection at p < 0.05; 180 calendar days (~125 trading days) is short of that, but it is the threshold at which a strong Sharpe begins to clear the noise floor. Below 180 the published returns are descriptive of a sample, not yet of a strategy.

The 3-venue requirement is independent of statistical sample size. It is a robustness condition: if a trader's reported alpha is a venue-specific artifact (an exchange's quirk, an API misclassification, a one-off rebate), connecting three venues forces the alpha to either show up consistently across them or be flagged as venue-specific. It is not a measure of more trading; it is a measure of having multiple redundant sources of evidence.

The 7-day gap ceiling enforces continuity. A trader who repeatedly disconnects their API for weeks at a time can selectively show only their good periods — a soft form of survivorship bias at the personal level. Capping the maximum gap at one week makes that practice impossible without losing the Gold tier.

What the tiers do not measure

Returns. A Gold-tier trader can have terrible returns. A Bronze-tier trader can have great returns. The tier reflects evidence depth, not outcomes.
Strategy quality. The depth tier does not evaluate whether the underlying approach is sensible, sustainable, or replicable.
Risk-adjusted performance. Sharpe, Sortino, and similar ratios are not used to determine the tier — they cannot be displayed publicly under NakedPnL's strip-back analytics policy.
Future performance. A long, clean past is statistical evidence that the published numbers are not noise. It is not a forecast.

How depth interacts with chain integrity

Each NAV snapshot that contributes to verified-day count is also an entry in the trader's hash chain (lib/calculation/audit-hash.ts). The chain is append-only, so days that have been counted toward a depth tier cannot later be removed without breaking the SHA-256 chain hash and the Bitcoin OpenTimestamps anchor. A Gold trader cannot quietly trim their bad weeks out of the underlying record.

This is the structural reason the depth tier is meaningful at all. Without an append-only chain, a trader could pad their verified day count with a long string of recent good days while privately deleting earlier bad ones. The append-only constraint forces the day count to be a property of the genuine historical record.

How to read a track record critically

When evaluating any published track record — on NakedPnL or anywhere else — three questions are decisive:

How long is the record? Less than a year is preliminary; less than two years is suggestive; three years or more begins to be statistically meaningful for typical Sharpes.
How continuous is the record? Frequent gaps suggest selective publication. A continuous chain with no unexplained absences is much stronger evidence.
Is the record append-only? Can the trader silently revise or remove entries after the fact? An independently re-verifiable hash chain anchored to a public timestamping system rules this out.

NakedPnL's verification depth tier surfaces all three properties in a single label. A Gold tier means the trader has cleared the basic statistical threshold, has multiple independent venues of evidence, and has not skipped weeks of unfavourable history. It is the minimum bar at which a published track record is worth evaluating in detail.

Frequently asked questions

Why isn't 30 days enough to evaluate a trader?

Because the standard error of the Sharpe ratio scales as 1/sqrt(T). With only 30 trading days, the 95% confidence interval around an observed Sharpe of even 3 is roughly +/- 1, which spans the difference between 'extraordinary' and 'merely good'. For more typical Sharpe values around 1, the 30-day confidence interval entirely contains zero — the data cannot reject the hypothesis that the trader has no skill at all.

What is the minimum track record NakedPnL would consider Gold tier?

180 verified days, 3 linked exchange venues, and no snapshot gap exceeding 7 days. Those are AND conditions: a trader missing any one of them is at most Silver. The 180-day threshold corresponds to roughly 6 months of daily NAV data, which is the point at which the statistical noise floor begins to clear for strong Sharpe ratios.

Can a trader get to Gold by adding venues without trading more?

No. The 3-venue requirement is necessary but not sufficient. The 180-day requirement applies independently. A trader with 4 connected venues and 60 days of snapshots is Silver, not Gold, because the day count fails the threshold. The system requires evidence on all three axes.

Why isn't the Sharpe ratio used to determine the tier?

Because Sharpe ratio is a performance metric, and the depth tier is an evidence metric. NakedPnL's strip-back analytics policy specifically does not display Sharpe, Sortino, Calmar, or other risk-adjusted measures publicly — those are calculated for the admin layer only. The user-facing depth tier reflects how much evidence has been collected, not how good the trader is. A bad trader with a long clean record is Gold tier; the tier guarantees the badness is well-measured, not that there is no badness.

What happens if a trader's snapshot record has a gap exceeding 7 days?

They cannot achieve Gold tier as long as the gap remains the maximum gap in their record. New continuous data does not erase old gaps — those gaps are permanently part of the chain. A trader who had a 30-day gap two years ago cannot reach Gold even if everything since has been continuous. This is intentional: it prevents traders from quietly skipping bad periods early in their tenure and then earning Gold later.

Are the Lo (2002) sample-size formulas exact?

They are asymptotic approximations. For very short windows (T < 50 observations) the actual distribution of the sample Sharpe departs measurably from the normal approximation, and exact bounds require simulation or the corrected formulas in Lo's paper. For the timeframes relevant to NakedPnL's depth tiers (90+ days), the asymptotic formulas are accurate to within a few percent of the true required sample size, which is sufficient for tier calibration.

Why include linked venues in the depth tier at all?

Because reported alpha can be venue-specific. A trader who looks profitable on Binance but never connects Bybit might be exploiting a quirk, a rebate, or a misclassified API field. Forcing 3 venues for Gold tier requires the apparent skill to either reproduce across exchanges or be flagged as venue-dependent. It is a robustness check on the underlying signal, not a measure of trade volume.

References

Methodology guide

Verification Depth — How Long a Track Record Has to Be to Mean Something

30 days of trading is statistical noise. Here is the math behind sample-size requirements and how NakedPnL's Bronze, Silver, and Gold tiers are calibrated.

By NakedPnL Research·May 7, 2026·13 min read

TL;DR

A 30-day track record cannot statistically distinguish a skilled trader from a lucky one with any meaningful confidence.
Lo (2002) shows the standard error of an annualized Sharpe ratio scales as 1/sqrt(T) — short windows have wide confidence intervals.
12 months is the minimum at which a strong Sharpe (1.5+) becomes detectable above noise; weaker Sharpes need years.
NakedPnL's Bronze/Silver/Gold tiers map to verified days, linked venues, and continuity — not to performance.
The depth tier is descriptive of evidence weight, never of trader quality. A Gold tier with poor returns is still poor returns.

Why short tracks lie

SE(Sharpe_hat) ~= sqrt((1 + 0.5 * S^2) / T)

where T is the number of years observed and S is the (unknown) true Sharpe.

Lo (2002), Equation 14. Holds for IID returns; the formula generalizes for autocorrelated returns with a worse constant.

T_min ~= 1.96^2 * (1 + 0.5 * S^2) / S^2

For S = 0.5:  T_min ~= 19.2 years
For S = 1.0:  T_min ~= 5.8 years
For S = 1.5:  T_min ~= 3.0 years
For S = 2.0:  T_min ~= 2.0 years
For S = 3.0:  T_min ~= 1.0 years

Approximate years required to distinguish a given true Sharpe from zero at 95% confidence with daily IID returns.

The trade-off between length and recency

Sample size requirements by Sharpe ratio

True Sharpe (S)	Years for p < 0.05	Years for p < 0.01	Trading days at 252/yr (p < 0.05)
0.25	108	186	27,200
0.50	19	33	4,800
0.75	9.4	16	2,360
1.00	5.8	10	1,460
1.50	3.0	5.1	750
2.00	2.0	3.4	500
2.50	1.5	2.5	370
3.00	1.0	1.8	280

Approximate sample size required to reject the null hypothesis Sharpe = 0 in favor of the indicated true Sharpe, under Lo (2002) IID assumptions.

What track record length does and does not prove

How NakedPnL's depth tiers are calibrated

Tier	Verified days	Linked venues	Max gap (days)
GOLD	>= 180	>= 3	<= 7
SILVER	>= 90 OR linked venues >= 2	>= 1	(none)
BRONZE	>= 1 (and below SILVER)	>= 1	(none)
UNVERIFIED	0	0	(none)

Depth tier thresholds in lib/verification-depth.ts. A trader must satisfy ALL conditions for their tier; failure to meet any GOLD condition demotes to SILVER.

Why those specific thresholds

What the tiers do not measure

Returns. A Gold-tier trader can have terrible returns. A Bronze-tier trader can have great returns. The tier reflects evidence depth, not outcomes.
Strategy quality. The depth tier does not evaluate whether the underlying approach is sensible, sustainable, or replicable.
Risk-adjusted performance. Sharpe, Sortino, and similar ratios are not used to determine the tier — they cannot be displayed publicly under NakedPnL's strip-back analytics policy.
Future performance. A long, clean past is statistical evidence that the published numbers are not noise. It is not a forecast.

How depth interacts with chain integrity

How to read a track record critically

When evaluating any published track record — on NakedPnL or anywhere else — three questions are decisive:

How long is the record? Less than a year is preliminary; less than two years is suggestive; three years or more begins to be statistically meaningful for typical Sharpes.
How continuous is the record? Frequent gaps suggest selective publication. A continuous chain with no unexplained absences is much stronger evidence.
Is the record append-only? Can the trader silently revise or remove entries after the fact? An independently re-verifiable hash chain anchored to a public timestamping system rules this out.

Frequently asked questions

Why isn't 30 days enough to evaluate a trader?

What is the minimum track record NakedPnL would consider Gold tier?

Can a trader get to Gold by adding venues without trading more?

Why isn't the Sharpe ratio used to determine the tier?

What happens if a trader's snapshot record has a gap exceeding 7 days?

Are the Lo (2002) sample-size formulas exact?

Why include linked venues in the depth tier at all?