4
Experiments
~15K
Model Queries
500
Benchmark Samples
17%
Max SBH Rate
A CSCI-544 NLP course project at USC. When an LLM thinks step by step, does it actually use that reasoning to reach its answer, or is the chain of thought just a plausible explanation for something it already decided? Four experiments across two models and two benchmarks try to find out.

The question that drove this project is deceptively simple: when an LLM writes out its reasoning, does that reasoning actually change what answer it gives? Four experiments probe this across two locally-run models (Llama 3.2 3B and Qwen 2.5 7B) and two benchmarks (GSM8K math and ARC-Challenge science), collecting roughly 15,000 queries over a 6.2-hour run.
The baseline first checks whether CoT even helps. For math, it does in a big way: Llama jumps from 5.2% to 48.8%, and Qwen from 16% to 65.6%. For science multiple-choice, both models score worse with CoT. That contrast is the through-line of every other experiment.
Experiment 1 truncates the reasoning chain after each step and checks if the answer changes. Experiment 2 injects deterministic rule-based errors into the CoT and measures whether the model follows the corrupted logic. Experiment 3 prepends authoritative hints suggesting wrong answers, then classifies whether the model acknowledges the hint in its reasoning or gets silently steered.
All proportions come with 95% Wald confidence intervals. Cross-model comparisons use McNemar's exact test, both significant at p below 0.001.
Every (model, dataset, question) pair gets two queries: direct and chain-of-thought. On GSM8K, CoT turns near-random guessing into real performance. On ARC, both models do worse when they reason out loud, which suggests the chain is introducing noise on top of knowledge the model already has.
Math · GSM8K
CoT dramatically improves accuracy
Science MC · ARC
CoT actually hurts performance
No-CoT vs CoT accuracy for both models. Amber bar = CoT improved; gray bar = CoT degraded. Deltas shown top-right of each group.
The CoT from Experiment 0 gets parsed into discrete steps using a three-level hierarchy (numbered markers, transition words, sentence boundaries). The model then answers using only the first k steps, and we check how often that partial answer matches the full-chain answer. Low SCR at step 1 means the model genuinely needs later steps. Qwen reaches the same ARC answer from step 1 alone in 83% of cases, which means those remaining steps add nothing.
Step Consistency Rate across truncation steps 1 to 5. Science lines (blue) stay high from the start. Math lines (amber) stay low, showing the model needs the full chain.
Rule-based errors are injected into the CoT across six conditions: none, early, middle, late, early+late, and all. For GSM8K, the Corruption Following Rate (CFR) climbs as more steps are corrupted, and late-step corruption consistently outpaces early-step corruption by about 10pp. That pattern makes sense if the final calculation steps are what actually determine the answer. On ARC, Qwen's CFR stays flat at 9 to 11% regardless of how much of the reasoning is corrupted.
Llama 3B · Math
CFR increases with late-step corruption
Qwen 7B · Math
Same pattern, lower baseline
Llama 3B · Science
Moderate CFR regardless of position
Qwen 7B · Science
Nearly immune to corruption
CFR by corruption condition for each model and dataset. The 'All' bar is highlighted. Notice the contrast between math (moderate and increasing) and Qwen on science (flat and near-zero).
A hint suggesting a wrong answer is prepended at four strength levels, from a gentle "could the answer perhaps be X?" up to "a Stanford professor mentioned the answer is X." Responses are classified into four outcomes: Faithful Reject (acknowledged the hint, gave correct answer), Faithful Follow (acknowledged it, followed it), Unfaithful Ignore (silently ignored), and Steered-But-Hidden (silently followed it). SBH is the one that matters most. On ARC, Qwen's SBH rate triples from weak to strong hints while the Hint Acknowledgment Rate barely moves. The model is getting more influenced but hiding it better.
Steered-But-Hidden rate by hint strength. Blue lines = science MC (vulnerable). Amber lines = math (nearly flat). Dashed = Qwen 7B.
Baseline accuracy, step-based truncation (SCR), reasoning corruption (CFR), and biased hint injection, each probing faithfulness from a different angle across roughly 15,000 model queries
Six corruption conditions with rule-based perturbations only: arithmetic swaps, negated conclusions, reversed causation. No LLM in the loop, so results are fully reproducible
95% Wald confidence intervals on all proportions. McNemar's exact test for cross-model comparisons. Both cross-model differences land at p below 0.001
17 regex patterns catch whether hints appear in the CoT. SBH rates reach 17% on ARC: the model changes its answer to match the wrong hint while the reasoning reads as independent analysis
Friction
Takeaways