— Methodology · v0.1.4 · last revised 2026-04-12 —

How we do this poorly, on purpose.

A guide to what Standard Poorly measures, what those measurements mean, and the long list of reasons you should not bet your house on any of it.

TL;DR · The 5 things that matter

What we measure: day-over-day z-score of any publicly observable signal, against a 180-day baseline. When |z| ≥ 2.5, we call it a “fire” and check what SPY did at 1d / 1wk / 1mo / 3mo / 6mo / 1yr.
Why 6 horizons: so we can’t cherry-pick. A signal that works at 1d but breaks at 3mo is the most honest thing on this site — and we show both.
The 1–10 grade: percentile within our universe (not absolute). 10/10 = top decile by composite score. Floors enforce minimum p-value + sample size at each grade level.
The multiple-testing problem: if you test 658 indicators against SPY at α=0.05, ~33 will look “significant” by pure chance. We publish them anyway, in their own section, and tell you which ones.
Walk-forward OOS: every indicator that earns a “validated” tag did so on data it had never seen during the discovery window. Failed-OOS indicators stay published — marked “failed”, graded down -25 points — so you can see what we tried and what didn’t survive.

Skip to: Pricing & access → · What we will not do →

§01

How this works

For each indicator we publish, we compute the day-over-day change in a publicly observable signal — an App Store rank, a Google Trends index, a weather station reading, the closing price of a meme stock. We z-score that change against a 180-day rolling window. When the absolute z-score crosses 2.5, we mark the day as a “fire.” (A separate, looser live pulse — |z| ≥ 2.0 against a 30-day window — powers the real-time watchlist; §04 explains why both exist and how to tell them apart.)

We then check what happened to the SPY at six forward horizons — 1, 5, 21, 63, 126, and 252 trading days — and report two numbers per horizon: the hit rate (the fraction of fires followed by a positive forward return) and the p-value for that hit rate against a null of 50%.

The horizons are not opinions. We picked them because they correspond to “tomorrow,” “this week,” “this month,” “this quarter,” “six months,” and “a year,” and because reporting only one of them would let us cherry-pick. By showing all six, you can see the cases — quite common — where a signal works at one horizon and falls apart at the others. Those cases are the most honest thing about this website.

§02

The 1–10 signal grade

Every indicator on the site carries a single number: 1 to 10. It exists because readers ask one question more than any other — “the F&G card says 73%, the Wikipedia card says 66% — which should I trust more?” The grade collapses the four-axis “is this real?” problem into one comparable score.

The grade is percentile within our universe, not an absolute rating. A 10/10 means “top 10% of all indicators we publish, ranked by composite score.” A 5/10 is around the median. A 1/10 is the bottom decile. This is self-calibrating: as we add more indicators, the distribution rebalances so a 10 always means top decile, not a number that drifts over time.

The composite score is built from five weighted axes. Each is capped so no single factor can dominate:

Effect size (cap +30): the average forward return on fire days, signed by signal direction. A 3% average earns the full bonus.
Frequency edge (cap +15): how far the directional hit-rate sits from a 50% coin-flip. A 65% hit rate earns the cap.
Sample size (cap +15): log-scaled. n=100 earns +10 points; n=1,000 earns the +15 cap.
Validation status: +25 points for out-of-sample-validated, +10 for in-sample-only, −25 for failed-OOS.
P-value: +15 if p<.01, +10 if p<.05, 0 if p<.10, −10 if p≥.10.

We also enforce floors — hard evidence bars an indicator must clear to land in the top of the scale. Without floors, adding 400 weak indicators would lift previously-MODERATE ones to STRONG by sorting alone. Floors fix this:

Grade ≥ 9 requires p<.01 AND n≥500.
Grade ≥ 7 requires out-of-sample validation.
Grade ≥ 6 requires n≥100.
Grade ≥ 5 requires not-failed OOS.

Tier labels map onto the 1–10 scale: STRONG (8–10), MODERATE (6–7), WEAK (4–5), NOISE (1–3), and TOO NEW (pending — not yet enough data to score).

Honest caveat: the weight choices and floor cutoffs are editorial judgments, not derived from a back-test. We picked them to encode “OOS-validation matters most, statistical significance second, raw effect size third.” A future version will back-test these weights against actual signal performance.

§03

View modes (skin × lexicon)

Two independent toggles in the masthead let you pick how the site presents itself. They cross to give four total views:

Skin (icon ◧/▮ or keyboard T) — Almanac renders as a cool-stone broadsheet newspaper. Terminal renders as a Bloomberg-green CRT.
Lexicon (icon A/σ or keyboard L) — Plain uses everyday English (“same direction,” “STRONG signal”). Technical uses the underlying statistics jargon (“concordant,” “8/10 · OOS✓ · p<.01,” raw r/p/n readouts).

Skin and lexicon are orthogonal. Some readers want the newspaper look but technical copy. Some want the Bloomberg look with plain English. Your preferences persist across pages and sessions.

§04

Why σ=2.5, window=180d (and how we picked)

A “fire day” means the indicator's value moved at least 2.5 standard deviations from its 180-day rolling average. We didn't pick those numbers because they're conventional — we picked them because we tested.

Two related-but-distinct definitions live on the site. The validated “fire” is the one above (σ ≥ 2.5 vs 180-day baseline), used by the historical engine that powers every indicator's OOS scorecard. A separate short-term “pulse” uses a looser threshold (σ ≥ 2.0 vs 30-day baseline) and powers the live /fires watchlist + the homepage cofiring banner. We use both intentionally: the validated fire is the audited record, the pulse is the live-engine read. Pulse hits should be treated as “worth watching today,” not as the same OOS-validated signal the methodology section describes.

Our AutoResearch loop runs deterministic parameter sweeps across multiple indicator categories, with held-out validation slices the optimizer never sees. Across five rounds of sweeps covering the vol panel, an extended 1990-2026 history, edge-case σ/window combinations, and cross-category robustness on ~190 indicators (chaos, macro, weather, technical), the robust interior optimum we landed on is σ=2.5 / window=180d.

The headline finding from the cross-category run: every category we tested prefers a 180-day window, and three of four prefer σ=2.5 (macro is slightly happier with a softer threshold). The previous default we inherited (σ=2.0 / window=60d) was beaten on every panel — the original search space was just too narrow to see the better corner of the surface.

We don't publish the per-round sweep grids, the corner-edge lift numbers, or the category-by-category validation deltas — those would hand a competitor the entire AutoResearch playbook. The consolidated findings (and the surface heatmaps) are part of the Analyst-tier research bundle.

The discipline that earned the change: AutoResearch runs deterministically in Python (zero LLM tokens during the loop), tests against held-out validation periods the optimizer never sees, and re-runs on a fixed cadence. If you find a better setting, the rubric is open — we'll re-run and switch when evidence justifies it.

§05

The multiple-testing problem

We are currently testing 658 indicators across 6 horizons — 3,948 hypotheses. At a conventional 5% significance threshold, we should expect ~197 of those to look “significant” by chance alone, even if every indicator were pure noise.

This is the problem that makes most quantitative blogs irresponsible. They run a thousand backtests, publish the seventeen that worked, and never mention the other 983. Standard Poorly publishes all 3,948. The site is, in a real sense, a controlled demonstration of why p-hacking works.

“Forgive me, for I have multiple-tested.”

— The scientist's prayer, more or less

If you want to use any of these signals seriously, the appropriate adjustment is Bonferroni: divide your significance threshold by the number of hypotheses you considered before picking one. Our published per-test p-values are not family-wise corrected — they are the raw numbers. We show them raw because correcting them in advance would hide what we are actually doing, which is testing a lot of things in public.

§06

Walk-forward out-of-sample

An in-sample p-value tells you the data fit; it does not tell you the signal will continue. To check for the latter, we hold out the last 90 trading days of every series, fit the rule on the rest, and then evaluate the held-out window with the rule frozen. If the hit rate in the holdout is consistent with the in-sample rate, we mark the indicator OOS validated. If it isn't, we mark it not significant and we keep showing it anyway.

We do not run OOS continuously — only on a quarterly cadence, because too-frequent re-validation is its own kind of multiple testing. The badge on every card shows you when the last OOS test ran and whether the signal survived.

§07

Why we publish spurious correlations anyway

The honest answer is that the joke is the point. There is a real intellectual tradition — Tyler Vigen's Spurious Correlations, the stock-market-and-Super-Bowl people, the Hindenburg Omen — of finding statistical relationships that look meaningful and are not. Standard Poorly is the first publication to do this with full disclosure of the meta-statistics that make the genre nonsense.

So when you read on the home page that “BetterHelp downloads spiked 40%, historically a 72% hit-rate signal for SPY gains over the next five trading days,” the correct response is not “I should buy SPY.” It is “interesting; how many other tests is Standard Poorly running today, and is 72% even surprising given that?” The answer is in the disclaimer at the bottom of every card. We did the homework. You can audit it.

“Nobody asked. We ledger anyway.”

— House style

§08

Pricing & access

Four tiers. The full feature-by-feature matrix lives on the pricing page — this section is the one-line summary so you can place yourself before clicking through.

Almanac

Free

forever

Read everything. Daily readings, full charts, every pundit's open positions + headline performance.

Words we use

We use two voices on this site: Almanac (plain English, newspaper-style, our default) and Terminal (technical, for readers who toggle into Bloomberg-mode). Some terms get translated, some only show up on the technical side. Here's the translation table.

Almanac voice (default)	Terminal voice	What it actually measures
How unusual today is	z-score	How many standard deviations today’s value sits above or below the recent rolling average. \|z\|=2.5 means “rarer than 99% of recent days.”
How extreme it has to be to count	sigma threshold (σ)	The cutoff at which we say a value is “unusual enough to fire.” Currently 2.5σ.
What we compare today to	baseline window	How many trading days back we average to get the “normal” value. Currently 180.
How far ahead we look	forward horizon	How many trading days into the future we measure SPY’s move after a fire. We track 1, 5, 21, 63, 126, and 252 days.
How often this turns out right	hit rate	Of all the times this signal fired, the percentage where the market moved in the predicted direction.
How strong the call was	conviction (claim strength)	For pundit predictions only — a 0–100 score for how forcefully the speaker stated the view. Pundit Index pages explain it in detail.
Two signals firing around the same time	cofire / co-fire	When indicator A and indicator B both fire within a few trading days of each other.
When these both fire	stacked / joint signal	A pair where both indicators must fire together to count as a signal. The combined predictive power is often stronger than either alone.
Pointed one way before, the other way lately	sign flip / regime change	A signal that was bullish in older data but bearish recently (or vice versa). Could be a real shift in how markets relate, or just noise. We mark these so you can decide.
Could be luck	low confidence / underpowered	A signal with too few examples to distinguish real pattern from coincidence. Shown with caution.
Re-rolled the dice 1000 times to check	bootstrap confidence interval	A statistical robustness test. We resample the data 1000 times to see if the hit rate would still hold under different sample compositions.
We adjusted for the fact that we tested a lot of stuff	multiple-testing correction	When you run thousands of tests, some look significant by chance. We track and disclose the total test count so you can adjust your priors. See section 05.
Held up out of sample	OOS validated	The signal’s hit rate held up on data we kept hidden during model fitting. Section 06.
Reliable pair	STABLE-BULL / STABLE-BEAR	A two-signal combination that works the same way both in our 2010–2019 records AND in held-out 2020+ data. The trustworthy ones.
Flipped recently	SIGN-FLIP	A pair that pointed one direction historically but the opposite recently. We show with a warning.
Could be luck	LOW-CONF	A pair with fewer than 20 examples in either time period. Pattern is suggestive but easily explained by chance.
Too new to judge	INSUFFICIENT	A pair with fewer than 8 fires in either period. We show it for completeness but tag it clearly.

Toggle between Almanac and Terminal voices using the L key on any page, or the lexicon switcher in the top-right corner. The Almanac voice prioritizes plain language and storytelling. The Terminal voice prioritizes brevity and the technical names that finance professionals expect.

§10

What we will not do

We will not sell trading signals. We will not let sponsorship influence which indicators surface or how they rank — the algorithm is the algorithm, and it is published. We will not offer “premium” indicators that look better because we cherry-picked them; the surfaceable threshold is the same for every signal. We will not retroactively edit a published p-value when it embarrasses us; we will publish a correction with the original number visible. We will not sell your data. We will not, ever, charge for the disclaimer.

We will run clearly-disclosed brokerage affiliate links and newsletter sponsorships. They are how a free public dashboard gets paid for. Every affiliate link is labeled in plain text. Every sponsored newsletter section is labeled Sponsored by [name]. Sponsorship buys exposure in clearly-marked editorial slots; it does not buy ranking, surfacing, or methodology changes. If you ever spot a correlation that looks like it shouldn’t be on the front page, let us know — we want the audit.

§11

How to verify this yourself

Every night, after data publishes, a script walks every JSON file under web/public/data/, computes a SHA-256 hash of each one, and writes a dated manifest: {date, files: {path: sha256}, manifest_sha256_of_yesterday}. That last field points at the hash of the previous day’s manifest — so the manifests form a chain. Rewrite any historical data file and every subsequent day’s hash breaks. Each manifest gets committed to our public GitHub repo, so git history is an independent timestamp witness: anyone can check out an old commit and confirm the manifest existed when we say it did.

Latest snapshot: 2026-07-29 · 1387 files hashed · manifest sha256 980df2e4c45a2c3fa8348e3ba8403ace3ff9fadb8f3b4be586f860f7091dfd7c · first snapshot in the chain

To re-verify any of it yourself: clone the repo, then run python scripts/verify_snapshot.py. It re-hashes every file the manifest claims and confirms the chain back to the prior day, entirely offline. See also Receipts for the resolved-call ledger this mechanism protects.