How we do this poorly, on purpose.
A guide to what Standard Poorly measures, what those measurements mean, and the long list of reasons you should not bet your house on any of it.
How this works
For each indicator we publish, we compute the day-over-day change in a publicly observable signal — an App Store rank, a Google Trends index, a weather station reading, the closing price of a meme stock. We z-score that change against a 90-day rolling window. When the absolute z-score crosses 2, we mark the day as a “fire.”
We then check what happened to the SPY at four forward horizons — 1, 5, 21, and 63 trading days — and report two numbers per horizon: the hit rate (the fraction of fires followed by a positive forward return) and the p-value for that hit rate against a null of 50%.
The horizons are not opinions. We picked them because they correspond to “tomorrow,” “this week,” “this month,” and “this quarter,” and because reporting only one of them would let us cherry-pick. By showing all four, you can see the cases — quite common — where a signal works at one horizon and falls apart at the others. Those cases are the most honest thing about this website.
The 1–10 signal grade
Every indicator on the site carries a single number: 1 to 10. It exists because readers ask one question more than any other — “the F&G card says 73%, the Wikipedia card says 66% — which should I trust more?” The grade collapses the four-axis “is this real?” problem into one comparable score.
The grade is percentile within our universe, not an absolute rating. A 10/10 means “top 10% of all indicators we publish, ranked by composite score.” A 5/10 is around the median. A 1/10 is the bottom decile. This is self-calibrating: as we add more indicators, the distribution rebalances so a 10 always means top decile, not a number that drifts over time.
The composite score is built from five weighted axes. Each is capped so no single factor can dominate:
- Effect size (cap +30): the average forward return on fire days, signed by signal direction. A 3% average earns the full bonus.
- Frequency edge (cap +15): how far the directional hit-rate sits from a 50% coin-flip. A 65% hit rate earns the cap.
- Sample size (cap +15): log-scaled. n=100 earns +10 points; n=1,000 earns the +15 cap.
- Validation status: +25 points for out-of-sample-validated, +10 for in-sample-only, −25 for failed-OOS.
- P-value: +15 if p<.01, +10 if p<.05, 0 if p<.10, −10 if p≥.10.
We also enforce floors — hard evidence bars an indicator must clear to land in the top of the scale. Without floors, adding 400 weak indicators would lift previously-MODERATE ones to STRONG by sorting alone. Floors fix this:
- Grade ≥ 9 requires p<.01 AND n≥500.
- Grade ≥ 7 requires out-of-sample validation.
- Grade ≥ 6 requires n≥100.
- Grade ≥ 5 requires not-failed OOS.
Tier labels map onto the 1–10 scale: STRONG (8–10), MODERATE (6–7), WEAK (4–5), NOISE (1–3), and TOO NEW (pending — not yet enough data to score).
Honest caveat: the weight choices and floor cutoffs are editorial judgments, not derived from a back-test. We picked them to encode “OOS-validation matters most, statistical significance second, raw effect size third.” A future version will back-test these weights against actual signal performance.
View modes (skin × lexicon)
Two independent toggles in the masthead let you pick how the site presents itself. They cross to give four total views:
- Skin (icon ◧/▮ or keyboard
T) — Almanac renders as a cool-stone broadsheet newspaper. Terminal renders as a Bloomberg-green CRT. - Lexicon (icon A/σ or keyboard
L) — Plain uses everyday English (“same direction,” “STRONG signal”). Technical uses the underlying statistics jargon (“concordant,” “8/10 · OOS✓ · p<.01,” raw r/p/n readouts).
Skin and lexicon are orthogonal. Some readers want the newspaper look but technical copy. Some want the Bloomberg look with plain English. Your preferences persist across pages and sessions.
Why σ=2.5, window=180d (and how we picked)
A “fire day” means the indicator's value moved at least 2.5 standard deviations from its 180-day rolling average. We didn't pick those numbers because they're conventional — we picked them because we tested.
Two related-but-distinct definitions live on the site. The validated “fire” is the one above (σ ≥ 2.5 vs 180-day baseline), used by the historical engine that powers every indicator's OOS scorecard. A separate short-term “pulse” uses a looser threshold (σ ≥ 2.0 vs 30-day baseline) and powers the live /fires watchlist + the homepage cofiring banner. We use both intentionally: the validated fire is the audited record, the pulse is the live-engine read. Pulse hits should be treated as “worth watching today,” not as the same OOS-validated signal the methodology section describes.
Our AutoResearch loop runs deterministic parameter sweeps across multiple indicator categories, with held-out validation slices the optimizer never sees. Across five rounds of sweeps covering the vol panel, an extended 1990-2026 history, edge-case σ/window combinations, and cross-category robustness on ~190 indicators (chaos, macro, weather, technical), the robust interior optimum we landed on is σ=2.5 / window=180d.
The headline finding from the cross-category run: every category we tested prefers a 180-day window, and three of four prefer σ=2.5 (macro is slightly happier with a softer threshold). The previous default we inherited (σ=2.0 / window=60d) was beaten on every panel — the original search space was just too narrow to see the better corner of the surface.
We don't publish the per-round sweep grids, the corner-edge lift numbers, or the category-by-category validation deltas — those would hand a competitor the entire AutoResearch playbook. The consolidated findings (and the surface heatmaps) are part of the Analyst-tier research bundle.
The discipline that earned the change: AutoResearch runs deterministically in Python (zero LLM tokens during the loop), tests against held-out validation periods the optimizer never sees, and re-runs on a fixed cadence. If you find a better setting, the rubric is open — we'll re-run and switch when evidence justifies it.
The multiple-testing problem
We are currently testing 230 indicators across 4 horizons — 920 hypotheses. At a conventional 5% significance threshold, we should expect ~46 of those to look “significant” by chance alone, even if every indicator were pure noise.
This is the problem that makes most quantitative blogs irresponsible. They run a thousand backtests, publish the seventeen that worked, and never mention the other 983. Standard Poorly publishes all 920. The site is, in a real sense, a controlled demonstration of why p-hacking works.
“Forgive me, for I have multiple-tested.”
If you want to use any of these signals seriously, the appropriate adjustment is Bonferroni: divide your significance threshold by the number of hypotheses you considered before picking one. Our published per-test p-values are not family-wise corrected — they are the raw numbers. We show them raw because correcting them in advance would hide what we are actually doing, which is testing a lot of things in public.
Walk-forward out-of-sample
An in-sample p-value tells you the data fit; it does not tell you the signal will continue. To check for the latter, we hold out the last 90 trading days of every series, fit the rule on the rest, and then evaluate the held-out window with the rule frozen. If the hit rate in the holdout is consistent with the in-sample rate, we mark the indicator OOS validated. If it isn't, we mark it not significant and we keep showing it anyway.
We do not run OOS continuously — only on a quarterly cadence, because too-frequent re-validation is its own kind of multiple testing. The badge on every card shows you when the last OOS test ran and whether the signal survived.
Why we publish spurious correlations anyway
The honest answer is that the joke is the point. There is a real intellectual tradition — Tyler Vigen's Spurious Correlations, the stock-market-and-Super-Bowl people, the Hindenburg Omen — of finding statistical relationships that look meaningful and are not. Standard Poorly is the first publication to do this with full disclosure of the meta-statistics that make the genre nonsense.
So when you read on the home page that “BetterHelp downloads spiked 40%, historically a 72% hit-rate signal for SPY gains over the next five trading days,” the correct response is not “I should buy SPY.” It is “interesting; how many other tests is Standard Poorly running today, and is 72% even surprising given that?” The answer is in the disclaimer at the bottom of every card. We did the homework. You can audit it.
“Nobody asked. We ledger anyway.”
Tier comparison
Press tier is free for a limited number of verified working journalists — same feature surface as Trader plus raw API access with a citation requirement. Apply here.
Words we use
We use two voices on this site: Almanac (plain English, newspaper-style, our default) and Terminal (technical, for readers who toggle into Bloomberg-mode). Some terms get translated, some only show up on the technical side. Here's the translation table.
| Almanac voice (default) | Terminal voice | What it actually measures |
|---|---|---|
| How unusual today is | z-score | How many standard deviations today’s value sits above or below the recent rolling average. |z|=2.5 means “rarer than 99% of recent days.” |
| How extreme it has to be to count | sigma threshold (σ) | The cutoff at which we say a value is “unusual enough to fire.” Currently 2.5σ. |
| What we compare today to | baseline window | How many trading days back we average to get the “normal” value. Currently 180. |
| How far ahead we look | forward horizon | How many trading days into the future we measure SPY’s move after a fire. We track 1, 5, 21, 63, 126, and 252 days. |
| How often this turns out right | hit rate | Of all the times this signal fired, the percentage where the market moved in the predicted direction. |
| How strong the call was | conviction (claim strength) | For pundit predictions only — a 0–100 score for how forcefully the speaker stated the view. Pundit Index pages explain it in detail. |
| Two signals firing around the same time | cofire / co-fire | When indicator A and indicator B both fire within a few trading days of each other. |
| When these both fire | stacked / joint signal | A pair where both indicators must fire together to count as a signal. The combined predictive power is often stronger than either alone. |
| Pointed one way before, the other way lately | sign flip / regime change | A signal that was bullish in older data but bearish recently (or vice versa). Could be a real shift in how markets relate, or just noise. We mark these so you can decide. |
| Could be luck | low confidence / underpowered | A signal with too few examples to distinguish real pattern from coincidence. Shown with caution. |
| Re-rolled the dice 1000 times to check | bootstrap confidence interval | A statistical robustness test. We resample the data 1000 times to see if the hit rate would still hold under different sample compositions. |
| We adjusted for the fact that we tested a lot of stuff | multiple-testing correction | When you run thousands of tests, some look significant by chance. We track and disclose the total test count so you can adjust your priors. See section 05. |
| Held up out of sample | OOS validated | The signal’s hit rate held up on data we kept hidden during model fitting. Section 06. |
| Reliable pair | STABLE-BULL / STABLE-BEAR | A two-signal combination that works the same way both in our 2010–2019 records AND in held-out 2020+ data. The trustworthy ones. |
| Flipped recently | SIGN-FLIP | A pair that pointed one direction historically but the opposite recently. We show with a warning. |
| Could be luck | LOW-CONF | A pair with fewer than 20 examples in either time period. Pattern is suggestive but easily explained by chance. |
| Too new to judge | INSUFFICIENT | A pair with fewer than 8 fires in either period. We show it for completeness but tag it clearly. |
Toggle between Almanac and Terminal voices using the L key on any page, or the lexicon switcher in the top-right corner. The Almanac voice prioritizes plain language and storytelling. The Terminal voice prioritizes brevity and the technical names that finance professionals expect.
What we will not do
We will not sell trading signals. We will not let sponsorship influence which indicators surface or how they rank — the algorithm is the algorithm, and it is published. We will not offer “premium” indicators that look better because we cherry-picked them; the surfaceable threshold is the same for every signal. We will not retroactively edit a published p-value when it embarrasses us; we will publish a correction with the original number visible. We will not sell your data. We will not, ever, charge for the disclaimer.
We will run clearly-disclosed brokerage affiliate links and newsletter sponsorships. They are how a free public dashboard gets paid for. Every affiliate link is labeled in plain text. Every sponsored newsletter section is labeled Sponsored by [name]. Sponsorship buys exposure in clearly-marked editorial slots; it does not buy ranking, surfacing, or methodology changes. If you ever spot a correlation that looks like it shouldn’t be on the front page, let us know — we want the audit.