Semantic Level of Detail in LLM Representations

Investigating SLoD geometry in frozen embeddings — from linear probes to activation steering to reasoning dynamics.

Core thesis: Frozen LLM representations encode a continuous Semantic Level of Detail (SLoD) axis that is linearly decodable without retraining. Exploiting this axis — to match retrieval granularity to query abstraction, to detect output drift, and to steer generation — measurably improves scientific QA attribution and knowledge extraction quality. Continuous embedding-space dynamics in reasoning chains provide supporting behavioral evidence.

The project contributes across four layers: mechanistic (SLoD is in the embedding space), control (activation steering shifts generation), systems (SLoD-routed retrieval), and behavioral (embedding dynamics reveal reasoning quality). All 20 experiments complete — total API cost ~$1.30 + GPU compute.

Research Roadmap

SH0 ✓  weak labels
 └─→ SH1 ✓  linear probe F1=0.72 ─────────┐
      ├─→ SH2 ✗ → SH2b ● → SH2-summ ✓  │
      ├─→ SH3 ●  soft RAG +0.031 F1    │
      └─→ SH4 ●  AUROC=0.676 ───────┘
                                └─→ SH5 ✗ → SH5a ✓ → SH5c ✓ → SH5d ✓✓
                                                                 └─→ SH6 ✓ AUROC=0.81

Confirmed Hypothesis validated Strong Robust across conditions Partial Qualified success Not confirmed Null or failed

Experiments

Mechanism — SLoD is in the embedding space

SH0 — Weak Label Bootstrap Confirmed

Claim: Document structure is a sufficient free proxy for SLoD labels (macro/meso/micro).

50 papers annotated. Labels validated for SH1 training.

Datasets: S2ORC, QASPER (HuggingFace). Macro = title/abstract/intro/conclusion, Meso = section leads, Micro = methods/tables/dense entities.

SH1 — Linear Probe Confirmed

Claim: A linear probe on frozen SciBERT embeddings classifies macro/meso/micro above chance.

Macro-F1 = 0.72 (per-class: macro 0.82, meso 0.62, micro 0.72)
37,278 length-matched spans, balanced 12,426/class

SH1b (document-level split): grouped F1 = 0.718, delta < 0.01. Leakage negligible.

SH1c (section control): residualized probe F1 = 0.40; section-name baseline crushes at 0.963. Cross-section transfer fails (F1 = 0.003). But LLM-blind annotation κ = 0.55 validates labels.

SH1-LLM: Retrained on LLM labels → F1 = 0.754 (+0.076). Embeddings encode SLoD beyond section identity.

Control — Activation steering shifts generation

SH2 — QA Steering Not confirmed

Claim: An SLoD direction in a generative model's residual stream steers generation toward target abstraction.

Doc-span steering: d = 0.043 (Mistral-7B), d = 0.020 (Qwen2.5-14B)

Cross-space mismatch + layer selection bug (abs vs signed). SciBERT QA evaluation ceiling at d ≈ 0.121 (genre mismatch).

SH2b — QA-Context Steering Partial

|d| = 0.546 but direction-inverted. 3/4 surface metrics significant, H3 fails (ROUGE drop = 0.244).

Direction inversion at α=2.0: micro steering produced macro-ward shift + quality collapse.

SH2-summ — Summarization Steering Confirmed

Key result: Activation steering works when evaluation axis is in-distribution.

Cohen's d = 0.679 (layer 8, α=2.0, 851 papers)
H1 pass, H2: 3/4 surface metrics, H3: ROUGE-L improves by 0.011

SH2-LLM: d → 0.701 with LLM axis (cosine sim 0.915). Robust to relabeling.

Design principle: task-domain alignment between steering target and evaluation metric is necessary.

Systems — SLoD-routed retrieval

SH3 — Hierarchical RAG Partial

Claim: Routing retrieval to query-matching granularity improves evidence attribution.

slod_weighted_parent: soft F1 = 0.422 vs baseline 0.391 (+0.031)
13 retrieval conditions tested at k=1,3,5,10,20

Soft SLoD score-boosting + parent expansion works. BM25 hybrid + cross-encoder rerank push to 0.458 (ceiling).

SH3-LLM: Stable (max delta -0.007). Soft boosting absorbs probe changes.

Hard routing destroys multi-level diversity. Specter2 underperformed MiniLM. HyDE classification did not help.

Application — Drift predicts extraction quality

SH4 — Abstraction Drift Partial

Claim: Drift between expected and realized SLoD per extraction field predicts extraction correctness.

Combined AUROC = 0.676 (passes 0.65 threshold)
Drift-only: AUROC = 0.52 (near random)

SH4-LLM: Drift-only → 0.62, combined → 0.75. Biggest relabeling beneficiary.

Drift alone is not diagnostic. Surface features (word count, entity density, LLM confidence) carry most signal.

Behavioral — Embedding dynamics reveal reasoning quality

SH5 — Jump Rate Not confirmed

Claim: Lower abstraction-level jump rate in CoT steps correlates with higher answer correctness.

2000 CoT traces, 6101 steps (Claude Haiku)
ρ = 0.003, p = 0.90 (null)

Unexpected finding: jump rate ↔ attribution-F1: ρ = +0.092 (more jumping = better evidence).

SH5a — Transition Matrix Confirmed

Claim: Specific transition patterns (not just overall jump rate) correlate with quality.

Macro→macro self-loop ↔ attr-F1: ρ = -0.197 (p = 5e-19)
20/60 feature-target pairs Bonferroni-significant

K-means reveals 2 reasoning styles: "exploratory" (attr-F1 = 0.279) vs "macro-stuck" (0.218).

SH5a-LLM: ρ = -0.197 (identical). Robust to relabeling.

SH5c — Context Alignment Confirmed

Claim: Alignment between retrieved context SLoD and reasoning step SLoD predicts quality.

Weighted alignment gap ↔ attr-F1: ρ = -0.135 (p < 0.0001)
SLoD-routed retrieval → significantly lower alignment gaps (Wilcoxon p < 0.05)

SH5c-LLM: ρ → -0.231 (71% stronger). Second-biggest relabeling beneficiary.

SH5d — Continuous Projection Strong

Claim: Continuous SciBERT embedding-space features predict quality without discrete probe classification.

slod_axis_mean ↔ attr-F1: ρ = +0.219 (p < 1e-21)
SLoD-axis AUROC = 0.615 vs orthogonal = 0.549 (3× correlation difference)

Strongest single predictor in the entire SH5 family. SLoD axis validated: Cohen's d = 2.65.

SH5d-LLM: ρ = +0.219 (identical). Robust to 24° axis rotation.

SH6 — LLM Pairwise SLoD Strong

Claim: Pairwise LLM judgments of SLoD over reasoning traces predict problem-solving success.

Trajectory shape AUROC = 0.81 (FrontierScience)
Tests whether trajectory shape predicts failure out of sample.

Reveals "macro-stuck" and "meandering" failure modes. Validates that SLoD is a judgment models can make themselves.

SH6 null on AgentHallu: homogeneous traces lack absolute SLoD variance, requiring absolute anchors for calibration.

Open Directions

Direction	Description	Effort	Priority
SH7	Cross-task SLoD generalization — apply SH2-summ steering vector to biomedical, legal, and news domains	3–5 days	High
SH6b	Absolute SLoD calibration for heterogeneous datasets — use hand-tiered anchors to calibrate ranking-based trajectories	3–5 days	High
Cross-domain	Test SH1 probe on bio/physics papers (currently CS-only); SH5d probe-free approach may generalize better	2–3 days	High
SH8	SLoD-adaptive retrieval + generation — combine SH3 soft retrieval with SH2-summ steering	5–7 days	Medium
Larger benchmark	Validate SH3 on LoCoMo or S2ORC QA subsets for larger-scale effect size confirmation	3–5 days	Medium
ADAM-Bench	SLoD typing on 27K papers, 7M evidence objects. Test whether SLoD-matched evidence improves claim verification	3–5 days	Medium
Combined SH5	SH5a+SH5c+SH5d features (AUROC = 0.623). Feature selection + better model may push above 0.65	1–2 days	Low
SH5b	Probe confidence calibration — does the probe's own max probability on CoT steps predict quality?	1 day	Low
SH5e	Per-step embedding trajectory analysis — curvature, acceleration, attractors	2–3 days	Low

Event Log

Week 1 — Mar 8–10, 2026

SH0 (weak label bootstrap) + SH1 (linear probe, F1=0.72) + SH3 (13 retrieval conditions, soft score-boosting, BM25 hybrid, cross-encoder ablation).

API cost: ~$0.30

Week 1–2 — Mar 10–14, 2026

SH4 (2 iterations: abstract-only → full-text extraction from QASPER) + SH5 (2000 CoT traces across 500 questions × 4 retrieval conditions).

API cost: ~$1.00

Week 2 — Mar 15–16, 2026

SH5a (transition matrix analysis) + SH5c (context-reasoning alignment) + SH5d (continuous embedding projection). Pure reanalysis of SH5 data, no new API calls.

API cost: $0.00

Week 2 — Mar 17–18, 2026

SH2 activation steering — 6 experiments on remote GPU server: SH2 (doc-span), SH2-scale (Qwen2.5-14B), SH2b (QA-context), SH2c (flip+low-α), SH2a (prompt control), SH2-summ (confirmed, d=0.679).

GPU compute (Mistral-7B, Qwen2.5-14B)

Week 3–7 — Mar 19–Apr 29, 2026

SH6 LLM Pairwise SLoD — Large-scale tournament analysis across 7 datasets (FrontierScience, SWE-agent, ProcessBench, AgentHallu). Development of trajectory failure-mode detectors.

API cost: ~$2.50 (Claude-3.5-Sonnet judgments)

Total: 21 experiments, ~$3.80 API cost + GPU compute for SH2, 0 human-in-the-loop annotation hours.