Investigating SLoD geometry in frozen embeddings — from linear probes to activation steering to reasoning dynamics.
Core thesis: Frozen LLM representations encode a continuous Semantic Level of Detail (SLoD) axis that is linearly decodable without retraining. Exploiting this axis — to match retrieval granularity to query abstraction, to detect output drift, and to steer generation — measurably improves scientific QA attribution and knowledge extraction quality. Continuous embedding-space dynamics in reasoning chains provide supporting behavioral evidence.
The project contributes across four layers: mechanistic (SLoD is in the embedding space), control (activation steering shifts generation), systems (SLoD-routed retrieval), and behavioral (embedding dynamics reveal reasoning quality). All 20 experiments complete — total API cost ~$1.30 + GPU compute.
SH0 ✓ weak labels └─→ SH1 ✓ linear probe F1=0.72 ─────────┐ ├─→ SH2 ✗ → SH2b ● → SH2-summ ✓ │ ├─→ SH3 ● soft RAG +0.031 F1 │ └─→ SH4 ● AUROC=0.676 ───────┘ └─→ SH5 ✗ → SH5a ✓ → SH5c ✓ → SH5d ✓✓
Claim: Document structure is a sufficient free proxy for SLoD labels (macro/meso/micro).
50 papers annotated. Labels validated for SH1 training.
Datasets: S2ORC, QASPER (HuggingFace). Macro = title/abstract/intro/conclusion, Meso = section leads, Micro = methods/tables/dense entities.
Claim: A linear probe on frozen SciBERT embeddings classifies macro/meso/micro above chance.
Macro-F1 = 0.72 (per-class: macro 0.82, meso 0.62, micro 0.72)
37,278 length-matched spans, balanced 12,426/class
SH1b (document-level split): grouped F1 = 0.718, delta < 0.01. Leakage negligible.
SH1c (section control): residualized probe F1 = 0.40; section-name baseline crushes at 0.963. Cross-section transfer fails (F1 = 0.003). But LLM-blind annotation κ = 0.55 validates labels.
SH1-LLM: Retrained on LLM labels → F1 = 0.754 (+0.076). Embeddings encode SLoD beyond section identity.
Claim: An SLoD direction in a generative model's residual stream steers generation toward target abstraction.
Doc-span steering: d = 0.043 (Mistral-7B), d = 0.020 (Qwen2.5-14B)
Cross-space mismatch + layer selection bug (abs vs signed). SciBERT QA evaluation ceiling at d ≈ 0.121 (genre mismatch).
|d| = 0.546 but direction-inverted. 3/4 surface metrics significant, H3 fails (ROUGE drop = 0.244).
Direction inversion at α=2.0: micro steering produced macro-ward shift + quality collapse.
Key result: Activation steering works when evaluation axis is in-distribution.
Cohen's d = 0.679 (layer 8, α=2.0, 851 papers)
H1 pass, H2: 3/4 surface metrics, H3: ROUGE-L improves by 0.011
SH2-LLM: d → 0.701 with LLM axis (cosine sim 0.915). Robust to relabeling.
Design principle: task-domain alignment between steering target and evaluation metric is necessary.
Claim: Routing retrieval to query-matching granularity improves evidence attribution.
slod_weighted_parent: soft F1 = 0.422 vs baseline 0.391 (+0.031)
13 retrieval conditions tested at k=1,3,5,10,20
Soft SLoD score-boosting + parent expansion works. BM25 hybrid + cross-encoder rerank push to 0.458 (ceiling).
SH3-LLM: Stable (max delta -0.007). Soft boosting absorbs probe changes.
Hard routing destroys multi-level diversity. Specter2 underperformed MiniLM. HyDE classification did not help.
Claim: Drift between expected and realized SLoD per extraction field predicts extraction correctness.
Combined AUROC = 0.676 (passes 0.65 threshold)
Drift-only: AUROC = 0.52 (near random)
SH4-LLM: Drift-only → 0.62, combined → 0.75. Biggest relabeling beneficiary.
Drift alone is not diagnostic. Surface features (word count, entity density, LLM confidence) carry most signal.
Claim: Lower abstraction-level jump rate in CoT steps correlates with higher answer correctness.
2000 CoT traces, 6101 steps (Claude Haiku)
ρ = 0.003, p = 0.90 (null)
Unexpected finding: jump rate ↔ attribution-F1: ρ = +0.092 (more jumping = better evidence).
Claim: Specific transition patterns (not just overall jump rate) correlate with quality.
Macro→macro self-loop ↔ attr-F1: ρ = -0.197 (p = 5e-19)
20/60 feature-target pairs Bonferroni-significant
K-means reveals 2 reasoning styles: "exploratory" (attr-F1 = 0.279) vs "macro-stuck" (0.218).
SH5a-LLM: ρ = -0.197 (identical). Robust to relabeling.
Claim: Alignment between retrieved context SLoD and reasoning step SLoD predicts quality.
Weighted alignment gap ↔ attr-F1: ρ = -0.135 (p < 0.0001)
SLoD-routed retrieval → significantly lower alignment gaps (Wilcoxon p < 0.05)
SH5c-LLM: ρ → -0.231 (71% stronger). Second-biggest relabeling beneficiary.
Claim: Continuous SciBERT embedding-space features predict quality without discrete probe classification.
slod_axis_mean ↔ attr-F1: ρ = +0.219 (p < 1e-21)
SLoD-axis AUROC = 0.615 vs orthogonal = 0.549 (3× correlation difference)
Strongest single predictor in the entire SH5 family. SLoD axis validated: Cohen's d = 2.65.
SH5d-LLM: ρ = +0.219 (identical). Robust to 24° axis rotation.
| Direction | Description | Effort | Priority |
|---|---|---|---|
| SH6 | SLoD-conditioned summarization quality — human/model preference for SH2-summ steered summaries at different abstraction levels | 3–5 days | High |
| SH7 | Cross-task SLoD generalization — apply SH2-summ steering vector to biomedical, legal, and news domains | 3–5 days | High |
| Cross-domain | Test SH1 probe on bio/physics papers (currently CS-only); SH5d probe-free approach may generalize better | 2–3 days | High |
| SH2-QA-v2 | Build a QA-genre SLoD evaluation axis (labeled QA answer pairs), then re-run SH2b. Baseline: d = 0.121 | 3–5 days | Medium |
| SH8 | SLoD-adaptive retrieval + generation — combine SH3 soft retrieval with SH2-summ steering | 5–7 days | Medium |
| Larger benchmark | Validate SH3 on LoCoMo or S2ORC QA subsets for larger-scale effect size confirmation | 3–5 days | Medium |
| ADAM-Bench | SLoD typing on 27K papers, 7M evidence objects. Test whether SLoD-matched evidence improves claim verification | 3–5 days | Medium |
| Combined SH5 | SH5a+SH5c+SH5d features (AUROC = 0.623). Feature selection + better model may push above 0.65 | 1–2 days | Low |
| SH5b | Probe confidence calibration — does the probe's own max probability on CoT steps predict quality? | 1 day | Low |
| SH5e | Per-step embedding trajectory analysis — curvature, acceleration, attractors | 2–3 days | Low |