Is the Perfect Shuffle a Myth? Part 5 - Validating the Baseline

25 June

Written By Paradis

A simulator is not evidence just because it runs.

That is the point of this part of the project. Part 4 put the machine into the simulation. The casino blackjack engine now exists, the card-source abstraction works, physical card identity is tracked, the discard rack is ordered, and the One2Six-style source can run without obvious state corruption.

That is necessary progress, but it is not enough.

Before I can interpret anything produced by the One2Six model, I need to know that the measuring instrument is not lying. If an IID source produces strange recurrence patterns, the shuffler result is meaningless. If blackjack frequency is calculated using the wrong denominator, the output is noise with a table around it. If a push breaks a win streak mechanically, the streak distribution is distorted before the shuffler is even tested.

The baseline is not administrative work. It is the first real test of the project.

This article is about that gate.

Why the baseline matters

The central question of this series is whether a continuous shuffler can preserve measurable short-horizon structure in the card stream.

That question cannot be answered directly by running a One2Six-style model and looking at profit. Profit is too far downstream. It depends on the card source, the game rules, the strategy policy, the discard timing, the settlement logic, the wager accounting, and variance. A positive result over 10,000 rounds might be interesting enough to investigate, but it is not evidence of an edge.

The first question is simpler:

Does the measurement framework behave correctly when the answer is already known?

That is the purpose of the IID baseline.

The IID source is deliberately unrealistic. A casino blackjack shoe is not an IID card generator. A manual shoe deals without replacement. A continuous shuffler is a physical state machine. IID ignores all of that and produces an independent random card on every draw.

That is exactly why it is useful.

With IID, the expected behaviour is clean. Rank frequencies should converge to uniform. Suit frequencies should converge to uniform. The rate of aces and ten-value cards should converge to known values. Target-card recurrence should follow the correct distribution. Blackjack rate should converge to a known value. If those checks fail, the problem is not the One2Six. The problem is the simulator, the metrics, or the assumptions.

The IID source is not the final comparator. It is the calibration source.

What is being validated

There are two different validation layers.

The first is source-level validation. This asks whether the card source itself is producing the expected stream. It does not care who wins the hand. It does not care about doubles, splits, or bankroll. It cares about the cards.

The second is game-level validation. This asks whether the casino blackjack engine, strategy policy, settlement logic, and result tracker produce plausible outcomes when driven by a known source.

Both layers matter.

A source can be correct while the game engine is wrong. A game engine can appear plausible while the source-level recurrence is broken. A result tracker can calculate a perfectly formatted edge number using the wrong denominator. All of these failure modes are possible, and none of them can be waved away if the project is meant to be serious.

The current experiment framework separates those concerns:

Layer	Question
Source-level metrics	Does the raw card stream behave as expected?
Game-level metrics	Do blackjack outcomes, wagers, streaks, and profit paths behave plausibly?
Plot generation	Do recurrence, streak, outcome, and profit distributions look structurally sane?
Baseline comparison	Does IID behave like IID before manual shoe or One2Six is interpreted?

This is why the experiment framework lives outside the core engine. The engine should run the game. The experiment layer should measure the output. Keeping those layers separate protects the model from analysis code leaking into game mechanics.

The denominator problem

One of the early corrections in the project was a denominator issue around blackjack frequency.

A raw count like this can look alarming:

465 blackjacks / $100,000 initial wagered = 0.465%

But that is not a blackjack rate.

That calculation divides an event count by dollars wagered. It produces a number, but the number does not mean what it appears to mean. The correct denominator is the number of initial hands.

For 10,000 initial hands, the calculation is:

465 blackjacks / 10,000 initial hands = 4.65%

That is completely plausible.

For an IID rank/suit card source, the probability of a two-card player natural is:

P(blackjack) = P(A then ten-value) + P(ten-value then A)
             = 2 * (4 / 52) * (16 / 52)
             = 128 / 2704
             ≈ 4.7337%

That is the relevant baseline for the IID source.

This looks like a small accounting correction, but it is not small. If the denominator is wrong, the project can manufacture false anomalies. A blackjack rate that appears to be 0.5% would suggest a catastrophic card-generation bug. The same count with the correct denominator is ordinary variance around the expected IID rate.

This is why result tracking now separates:

initial_wagered
action_wagered
total_wagered
net_profit
edge_per_initial_wager
edge_per_total_wager

Different questions require different denominators. Blackjack frequency is an event-per-hand statistic. House edge can be expressed per initial wager or per total wager. Doubles and splits affect action wager. Net profit belongs in dollars. Mixing those categories produces impressive-looking nonsense.

Recurrence is geometric

The project is ultimately about recurrence, latency, and return time. That means the baseline recurrence model has to be correct.

A common mistake is to reach for a Poisson model too early. Poisson is useful for approximating the number of events in a fixed window under certain conditions. It is not the direct model for the number of draws until a target appears again.

Inter-arrival time is geometric.

If a target has probability p on each independent draw, then the waiting time until the next appearance follows a geometric distribution. For a specific rank/suit target under an IID 52-card symbol model, such as T:S, the target probability is:

p = 1 / 52

For a rank-level target, such as any ten-value rank if defined as a rank class, or any specific rank such as five, the target probability depends on the target definition. For a specific rank in a 13-rank model:

p = 1 / 13

That distinction matters because the One2Six question is not merely “how often does a card type appear?” It is also “how long between appearances?” and later, for physical-card sources, “how long until the same physical card reappears after being discarded?”

IID cannot answer the physical-card question because IID has no physical memory. It can answer the symbol-recurrence question. That is enough for the first baseline.

The correct order is:

Check that IID symbol recurrence follows the expected geometric behaviour.
Check that manual-shoe physical-card recurrence behaves like a finite shoe with cut-card reshuffle.
Check that One2Six physical-card recurrence behaves like the configured shuffler model.
Compare the differences only after the first three layers are trusted.

Skipping the first layer would be a mistake. If the geometric overlay does not match IID, the later One2Six plots are not evidence. They are diagnostics for a broken measurement framework.

Source-level metrics

The source-level experiment draws cards directly from the IID source before the blackjack engine is involved.

That matters because it isolates the card generator from the game. If the raw card stream fails, there is no point debugging blackjack results.

The current source-level metrics include:

Metric	Purpose
Rank counts	Checks rank uniformity
Suit counts	Checks suit uniformity
Rank/suit counts	Checks symbol-level uniformity
Ace rate	Validates ace frequency
Ten-value rate	Validates ten-card frequency
Low-card rate	Validates Hi-Lo low-card frequency
Neutral-card rate	Validates Hi-Lo neutral-card frequency
High-card rate	Validates Hi-Lo high-card frequency
Hi-Lo mean and variance	Checks the card stream under a standard counting transform
Target-card recurrence	Checks inter-arrival times for a specific rank/suit target
Rank-level recurrence	Checks recurrence at a broader symbol level

The first targets currently include:

specific card targets:
    T:S
    5:S

rank targets:
    T
    5

The specific choices are not sacred. They are probes. A ten and a five are useful because they sit on opposite sides of common blackjack counting systems. A ten-value card and a low card should not show unexpected recurrence behaviour under IID.

If they do, something is wrong.

Game-level metrics

Once the source-level checks are sane, the next layer is the casino blackjack game.

The game-level experiment runs the blackjack engine with the IID source and records:

Metric	Reason
Rounds	Basic simulation scale
Initial hands	Correct denominator for natural blackjack and hand-level event rates
Wins, losses, pushes	Outcome distribution
Player blackjacks	Natural blackjack sanity check
Splits	Legal-action and strategy sanity check
Doubles	Legal-action and strategy sanity check
Busts	Hand-resolution sanity check
Initial wagered	Base exposure
Action wagered	Additional double/split exposure
Total wagered	Full amount at risk
Net profit	Real dollar outcome
Edge per initial wager	Standard headline edge metric
Edge per total wager	Useful accounting-normalised edge metric
Cumulative profit path	Variance and drift sanity check
Win/loss streak distributions	Outcome clustering check

This is not meant to prove the exact house edge of the implemented casino rule set. The current strategy is still an approximate H17 multi-deck strategy constrained by the implemented legal actions. A solver-generated exact strategy can come later.

At this stage, the goal is more basic:

Does the game engine behave plausibly under a source whose distribution is known?

If the answer is no, the One2Six analysis must wait.

Pushes and streaks

Streaks are easy to define badly.

For this project, pushes do not break streaks.

W W P W -> win streak of 3
L L P L -> loss streak of 3
W P L   -> win streak of 1, loss streak of 1

This definition matches bankroll direction. A push does not increase the bankroll and it does not decrease the bankroll. Treating it as a streak breaker would artificially shorten win and loss streaks and distort the distribution being measured.

This matters because one of the later questions is whether a One2Six-style source produces streak patterns that differ from IID or manual shoe baselines. If the streak metric is defined incorrectly, the comparison is polluted before the shuffler is tested.

The system now tracks current streaks, maximum streaks, full win-streak and loss-streak distributions, and signed streak distributions where losses appear on the negative x-axis and wins on the positive x-axis.

That gives the project a way to compare clustering without pretending that a single dramatic run proves anything.

Current validation checkpoint

The current IID smoke experiment produced:

Output directory: experiments/outputs/iid_smoke
Source draws: 10000
Game rounds: 1000
Player blackjack rate: 4.5000%
Expected IID blackjack rate: 4.7337%
Edge per initial wager: -4.2000%
Edge per total wager: -3.7534%

At 1,000 game rounds, the edge number is not meaningful. It is variance. The blackjack rate is also only a smoke-test result, but it is in the correct neighbourhood and, more importantly, it uses the correct denominator.

The purpose of this smoke test is not to certify the simulator. The purpose is to prove that the experiment framework runs end to end: source draw, game simulation, metric calculation, plot generation, and output writing.

That is a necessary checkpoint, not a final result.

The full IID validation needs to run at much larger scale.

The next IID gate

The next proper IID validation run should be large enough that obvious distributional errors have nowhere to hide.

A representative command is:

python scripts/run_iid_baseline_experiment.py \
  --source-draws 1000000 \
  --game-rounds 1000000 \
  --base-bet 10 \
  --seed 42 \
  --output-dir experiments/outputs/iid_1m_seed42

The exact seed is not important. The scale is.

The checks should include:

rank counts close to expectation
suit counts close to expectation
rank/suit counts close to expectation
ace rate close to 1 / 13
ten-value rate close to 4 / 13
expected Hi-Lo balance
target-card recurrence following the geometric overlay
rank-level recurrence following the geometric overlay
player blackjack rate close to 4.7337%
plausible win/loss/push distribution
plausible double, split, and bust rates
streak distributions without artificial push-breaking
cumulative profit path consistent with variance around the implemented strategy and rule set

This is not glamorous work, but it is the work that lets the rest of the project mean something.

A broken IID baseline would be useful because it would expose a defect while the expected answer is known. A clean IID baseline is useful because it gives the project permission to move to harder comparisons.

Either way, the baseline earns its place.

Why profit is not the first signal

The temptation is to look at the One2Six smoke-test result and ask whether the player won.

That is the wrong first question.

A 10,000-round One2Six-style sample produced a positive result in the current framework. That does not mean the source is favourable. It does not mean the machine is vulnerable. It does not mean the model is correct. It means a short simulation with an approximate strategy and a configurable source produced one profit path.

At this stage, profit is a symptom, not a diagnosis.

The first signals worth caring about are structural:

Are physical-card return times different from manual shoe?
Are recent discards unavailable for a measurable interval?
Do cards from the same discard batch reappear close together?
Does the output buffer create observable latency?
Do shelf ejections create clumping or anti-clumping?
Does rank or value autocorrelation differ from IID?
Do win/loss streaks differ after controlling for baseline variance?

Only after those questions have answers does betting strategy become serious.

If no measurable card-stream structure exists, there is nothing to exploit. If measurable structure exists but is not observable before a betting decision, there is still no practical edge. If measurable structure exists, is observable, and survives across plausible parameter settings, then the project can move from diagnostics to exploitation.

That sequence matters.

Manual shoe comes before One2Six comparison

After IID, the next comparator is not immediately the One2Six.

It is the manual shoe.

A manual shoe is not IID. It has finite-card depletion, cut-card penetration, discard trays, and reshuffle timing. It is closer to casino blackjack than IID, but still much cleaner than a continuous shuffler.

The manual-shoe source currently models:

finite physical decks
dealing without replacement
configurable deck count
configurable cut-card penetration
ordered discard tray
discards unavailable until reshuffle
no mid-round reshuffle
reshuffle at round boundary
physical card identity preservation

That comparator matters because some differences from IID are completely ordinary for a finite shoe. The One2Six should not be compared only to IID and declared strange whenever it differs. It should also be compared to the manual-shoe process.

The correct sequence is:

IID baseline
    -> manual-shoe baseline
        -> One2Six-style source
            -> parameter sweeps
                -> exploitability tests

Each layer removes one possible explanation.

What the baseline protects against

The baseline protects the project from five major errors.

Error	How the baseline exposes it
Bad card generation	Rank, suit, and rank/suit counts fail under IID
Wrong recurrence model	Geometric overlays fail for IID target recurrence
Bad blackjack accounting	Blackjack rate uses the wrong denominator
Bad result tracking	Wager, action, profit, and edge metrics become inconsistent
False shuffler signal	One2Six anomalies are interpreted before IID and manual-shoe behaviour is trusted

This is the difference between analysis and storytelling.

A story can begin with a suspicious machine and end with a theory. A serious project has to survive the boring checks. The fact that a result is boring does not make it optional.

In this project, boring is protection.

Current project status

At this stage, the project has:

a working casino blackjack engine
modular card sources
IID, finite shoe, manual shoe, and One2Six-style sources
physical-card identity tracking
ordered discard-rack modelling
result accounting split by initial wager, action wager, and total wager
streak tracking where pushes do not break streaks
a separate experiment framework
source-level IID recurrence metrics
game-level IID metrics
plotting for recurrence, streaks, outcomes, and profit paths
smoke-test validation showing the pipeline runs

That is meaningful progress, but it is not the end of validation.

The next gate is the large IID run. After that comes manual shoe. Only after those baselines behave as expected should the One2Six output be interpreted seriously.

That is the discipline of the project.

Final thought

The machine is the interesting object, but the baseline is the standard of proof.

It is easy to become fascinated by the One2Six mechanism. The carousel, the shelves, the feeder, the output buffer, the discard timing, the possibility of short-horizon memory - that is the part that makes the project feel alive.

But none of it matters if the baseline is wrong.

Before attacking the black box, the ruler has to be straight. IID is the first ruler. Manual shoe is the second. The One2Six model comes after that.

The question is not whether I can produce a graph that looks interesting.

The question is whether the graph means anything.

That starts here.

References and source notes

mathematical-ev/shufflemaster-simulation GitHub repository. Public code repository for the simulation framework.
experiments/ framework. Separate experiment layer for source-level and game-level metrics, recurrence diagnostics, plotting, and output artifacts.
IidRandomCardSource. IID validation source used to test rank/suit frequencies, recurrence, blackjack rate, strategy flow, settlement logic, and result accounting.
ManualShoeCardSource. Manual-shoe comparator used to separate ordinary finite-shoe effects from One2Six-style effects.
One2SixCardSource. Configurable stateful shuffler model used for later comparison after IID and manual-shoe validation.
Dimitri P. Bertsekas and John N. Tsitsiklis, Introduction to Probability. Useful background for modelling random processes, independence, and geometric waiting-time behaviour.
Bill Chen and Jerrod Ankenman, The Mathematics of Poker. Useful general foundation for separating model assumptions, expected value, variance, and exploitable decision-making.
CARD one2six User Manual, 10.02.2005. Useful background for One2Six operating procedure, discard insertion timing, front shoe, wheel, and inventory behaviour.
US Patent 6,659,460 B2, Card shuffling device. CARD / Shuffle Master patent family describing compartment-based card-handling architecture.
US Patent 6,889,979 B2, Card shuffler. Describes card forwarding into compartments, card counting, randomised compartment/drum movement, and group ejection in described embodiments.

Paradis