USUL

Created: May 11, 2026 at 8:05 AM

ACADEMIC RESEARCH - 2026-05-11

Executive Summary

RLVR/GRPO fixes for sparse/binary rewards: A set of May 2026 papers diagnose concrete optimization failures in RLVR/GRPO under binary/verifiable rewards (e.g., gradient starvation, unstable credit assignment) and propose algorithmic fixes that can improve stability and long-horizon agent training.
Automated test-time scaling controllers + cheaper aggregation: New work pushes test-time scaling from hand-tuned heuristics toward automated controller discovery and lower-cost aggregation/weighting, making multi-sample reasoning more production-viable under latency/cost constraints.
MoE architecture + systems to reduce all-to-all pain: Several MoE papers propose shared expert pools, modular/subset expert activation, federated expert clusters, and HPC-aware training optimizations to reduce communication overhead and improve deployability.
Agent safety/security evals for tool-mediated, multi-step risk: Benchmarks and pipelines target real deployment failure modes—risky moments in phone agents, sequenced coding misuse, ambiguous retail policy adherence, and binary patch auditing—likely to become gating evals for enterprise agent rollout.
LLM-judge reliability: invariance, constraint-level judging, and adaptive boundaries: Papers formalize desiderata and methods for more reliable LLM-as-judge evaluation (policy invariance, constraint-level scoring, benchmarkless contracts, and dynamic boundary evaluation), directly impacting RLHF/RLVR and safety auditing workflows.

Top Priority Items

1. RLVR/GRPO and post-training optimization: fixing binary-reward pathologies and improving credit assignment

Summary: These papers analyze why RLVR/GRPO-style post-training can become unstable or sample-inefficient under sparse/binary rewards common in verifiable tasks (math, code, tool-use success). They surface concrete degeneracies (e.g., vanishing/degenerate advantages when group outcomes collapse) and propose fixes spanning advantage shaping, pass-rate control, and more structured credit assignment. Collectively, they aim to make RLVR a more reliable lever for turning base models into robust long-horizon agents.

Details: What the papers do (methodology) - Each work studies RL with verifiable rewards (often binary success/fail) in the GRPO/RLVR family, typically by (a) theoretical/algorithmic analysis of the objective/gradient estimator under group sampling, and (b) empirical validation on verifiable benchmarks (commonly math/code-style tasks) to show stability or sample-efficiency changes under controlled ablations. See the individual arXiv sources for the specific experimental suites and algorithm variants: http://arxiv.org/abs/2605.07689v1, http://arxiv.org/abs/2605.06650v1, http://arxiv.org/abs/2605.05112v1, http://arxiv.org/abs/2605.07660v1, http://arxiv.org/abs/2605.06642v1, http://arxiv.org/abs/2605.06200v1. Key technical contributions (as a cluster) - Binary-reward pathology diagnosis: The papers characterize failure modes where learning signal collapses when most samples in a group share the same outcome (all pass or all fail), yielding low-variance but near-zero effective gradients (“gradient starvation”) and brittle learning dynamics under small group sizes or improving policies. (http://arxiv.org/abs/2605.07689v1, http://arxiv.org/abs/2605.06650v1) - Advantage/reward shaping to preserve learning signal: Proposed modifications re-scale or reshape advantages/rewards to maintain informative gradients even when raw binary outcomes saturate, improving stability and enabling smaller groups (lower sampling cost) without losing training signal. (http://arxiv.org/abs/2605.06650v1, http://arxiv.org/abs/2605.07660v1) - Pass-rate / difficulty targeting: Several approaches explicitly control or target pass-rate bands (or otherwise adapt sampling/weighting by difficulty) so training remains in a regime with informative feedback, rather than drifting into near-always-pass/near-always-fail where binary rewards stop differentiating behaviors. (http://arxiv.org/abs/2605.05112v1, http://arxiv.org/abs/2605.06642v1) - Structured credit assignment for agentic rollouts: A recurring theme is moving from “one scalar reward at the end” toward more structured signals aligned to the rollout (e.g., turn/token/tool-call structure; role separation such as anchor/explorer behaviors) to reduce miscrediting in long-horizon trajectories. (http://arxiv.org/abs/2605.06200v1, http://arxiv.org/abs/2605.07660v1) Why this matters for agent systems (applications) - Tool-using agents are dominated by sparse, delayed, and often binary outcomes (task success, tests passing, API call succeeded). These papers’ fixes directly translate into more stable post-training for agents on software automation, web tasks, and multi-step tool workflows where naive RLVR can stall or become unstable. (http://arxiv.org/abs/2605.07689v1, http://arxiv.org/abs/2605.06200v1) - Better credit assignment suggests concrete instrumentation changes: to exploit structured rewards, teams may need to log per-step/tool-call events, intermediate verifiers, and hierarchical labels (plan vs execution), enabling reward decomposition and more reliable learning. (http://arxiv.org/abs/2605.06200v1) Practical integration notes for an agentic infrastructure team - Training pipeline: Add explicit monitoring for group outcome collapse (all-pass/all-fail batches) and track effective advantage variance; use this as an early warning for gradient starvation and as a trigger for adaptive shaping or difficulty sampling. (http://arxiv.org/abs/2605.07689v1) - Data/rollout design: For long-horizon tasks, introduce intermediate verifiers (unit tests, schema checks, tool-call success predicates) to create denser, structured reward signals aligned to agent steps. (http://arxiv.org/abs/2605.06200v1) - Evaluation: Report learning curves vs group size and vs pass-rate regime; improvements that only show up at large groups may not be cost-effective for production-scale RLVR. (http://arxiv.org/abs/2605.06650v1)

Sources:

Importance: RLVR is increasingly the main mechanism to convert base-model capability into reliable agent behavior on verifiable tasks; small algorithmic improvements can yield large gains at fixed compute. For an agent platform, these results are strategically actionable: they imply specific training-time diagnostics (detecting saturation/gradient starvation), rollout instrumentation (step/tool-call signals), and curriculum/pass-rate control that can materially improve long-horizon tool-use reliability and reduce RL iteration cost. (http://arxiv.org/abs/2605.07689v1, http://arxiv.org/abs/2605.06200v1)

2. Automated discovery and efficiency improvements for test-time scaling / aggregation in LLM inference

Summary: These papers improve test-time scaling (multi-sample reasoning, branching, and aggregation) along two axes: automating the search for effective inference policies and reducing the cost of selecting/weighting candidate solutions. They suggest that controller synthesis and smarter aggregation can approach self-consistency gains with much lower token overhead. This makes test-time scaling more feasible for production agents constrained by latency and cost.

Details: What the papers do (methodology) - The cluster explores algorithmic inference-time policies that generate multiple candidates and then select/aggregate them, either by (a) automatically searching over branching/probing/pruning strategies, or (b) introducing cheaper proxies for candidate reliability than full extra sampling/judging. Empirical evaluations typically compare accuracy vs token/latency cost across reasoning benchmarks. (http://arxiv.org/abs/2605.08083v1, http://arxiv.org/abs/2605.08070v1, http://arxiv.org/abs/2605.07654v1, http://arxiv.org/abs/2605.06219v1) Key technical contributions - Automated controller discovery for TTS: Rather than fixed heuristics (e.g., “sample N then majority vote”), these works search/learn policies that decide when to branch, how many samples to draw, when to stop early, and how to allocate compute across steps—optimizing for an accuracy/cost objective. (http://arxiv.org/abs/2605.08083v1) - Cheaper aggregation/weighting: Proposals such as prefix-consistency-style weighting aim to approximate the benefits of self-consistency while reducing the number of full-length samples needed, lowering marginal token cost. (http://arxiv.org/abs/2605.08070v1) - Interaction-aware aggregation: Moving beyond independent-vote assumptions, some approaches model dependencies/correlation among candidates (e.g., energy/Ising-style formulations) to improve robustness when samples are not independent (common with low-temperature decoding or shared prefixes). (http://arxiv.org/abs/2605.06219v1) - Efficiency-focused TTS variants: Additional work targets reducing overhead in multi-sample pipelines by reusing partial computation or structuring sampling/selection to avoid generating full candidates that will be discarded. (http://arxiv.org/abs/2605.07654v1) Applications to agent systems - Agent orchestration can treat TTS as a first-class “reasoning budget allocator”: for expensive tool calls (web browsing, code execution), a controller can decide whether to spend compute on more internal deliberation vs acting, and can stop early when confidence is high. (http://arxiv.org/abs/2605.08083v1) - Correlated-candidate handling matters in agents because many candidates share the same retrieved context/tool outputs; interaction-aware aggregation can reduce systematic failure modes where all samples repeat the same wrong assumption. (http://arxiv.org/abs/2605.06219v1) Practical integration notes - Implement a pluggable TTS policy interface: expose knobs for branching factor, early-exit thresholds, and selection rules; then run offline policy search against your own task distribution and latency SLOs. (http://arxiv.org/abs/2605.08083v1) - Add candidate-correlation diagnostics: track shared-prefix lengths and agreement structure to decide when majority vote is unreliable and when interaction-aware selection is worth the overhead. (http://arxiv.org/abs/2605.06219v1)

Sources:

Importance: Test-time scaling is one of the highest-leverage ways to trade inference compute for accuracy without retraining, but historically has been too expensive/heuristic-driven for many production settings. Automated controller discovery plus cheaper, correlation-aware aggregation can turn TTS into a tunable, model-agnostic component of an agent runtime—directly affecting cost, latency, and reliability tradeoffs in tool-using systems. (http://arxiv.org/abs/2605.08083v1, http://arxiv.org/abs/2605.08070v1)

3. MoE architecture and systems advances: shared pools, modular experts, federated clusters, and HPC training optimization

Summary: This cluster advances Mixture-of-Experts along both architecture and systems dimensions, targeting the practical bottlenecks that often erase MoE’s theoretical efficiency gains. The papers propose shared/global expert pools, modular or subset expert activation for deployability, federated expert clustering, and platform-aware training optimizations to reduce communication overhead (notably all-to-all). The net effect is to make MoE training and serving more realistic on common multi-GPU/multi-node topologies.

Details: What the papers do (methodology) - The works combine architectural proposals (routing/expert organization) with systems evaluation (throughput, communication volume, scaling efficiency) on multi-GPU/multi-node setups, typically comparing against baseline MoE implementations to quantify all-to-all and placement overheads. (http://arxiv.org/abs/2605.06665v1, http://arxiv.org/abs/2605.06663v1, http://arxiv.org/abs/2605.06206v1, http://arxiv.org/abs/2605.05049v1) Key technical contributions - Shared/global expert pools: Architectures that allow experts to be reused across layers or modules (or otherwise pooled) aim to improve parameter efficiency and specialization without multiplying communication costs linearly with depth. (http://arxiv.org/abs/2605.06665v1) - Modular/subset expert activation: Designs that permit activating only a subset of experts (or modules) enable tiered serving (cheap subset vs full) and can reduce memory/communication requirements for constrained deployments. (http://arxiv.org/abs/2605.06663v1) - Federated expert clusters: Organizing experts into clusters (potentially aligned with hardware locality) reduces cross-node traffic and can mitigate all-to-all patterns that dominate runtime at scale. (http://arxiv.org/abs/2605.06206v1) - Platform/HPC-aware optimization: Training-stack optimizations and resource modeling target real bottlenecks (network topology, collective ops, kernel efficiency), narrowing the gap between paper MoE FLOP savings and realized wall-clock savings. (http://arxiv.org/abs/2605.05049v1) Applications to agent systems - Specialization for tool use: MoE is a natural fit for agent stacks where different “skills” (code, planning, retrieval, dialogue) benefit from specialization; modular/subset activation can route cheap requests to a small expert set and reserve full MoE capacity for hard tasks. (http://arxiv.org/abs/2605.06663v1) - Serving-tier design: Expert subsets can map to product tiers (latency-sensitive vs accuracy-sensitive) while keeping a single model family, simplifying orchestration and evaluation. (http://arxiv.org/abs/2605.06663v1) Practical integration notes - If you operate multi-node training: prioritize approaches that explicitly reduce all-to-all or align routing with hardware locality; measure not just tokens/sec but also communication time fraction and tail latency. (http://arxiv.org/abs/2605.06206v1) - If you operate multi-tenant serving: consider exposing “expert budget” as an inference-time control (subset size/top-k experts) that your agent runtime can tune per request. (http://arxiv.org/abs/2605.06663v1)

Sources:

Importance: MoE is one of the few scaling paths that can increase capacity without proportional FLOPs, but communication and deployability often dominate real costs. These papers are strategically relevant for any agent infrastructure that expects to serve large models under tight latency/cost constraints: they suggest concrete architectural knobs (subset activation) and systems strategies (locality-aware clustering, platform-aware optimization) that can materially improve throughput and enable specialization without untenable all-to-all overhead. (http://arxiv.org/abs/2605.06206v1, http://arxiv.org/abs/2605.06663v1)

4. Agent safety and security benchmarks/pipelines: phone risky moments, sequenced coding misuse, retail policy ambiguity, and binary patch auditing

Summary: These papers introduce agent-focused safety/security evaluations that move beyond single-turn refusal and test multi-step, tool-mediated failure modes seen in real deployments. They cover risky decision points in phone-agent interactions, sequenced coding misuse chains, ambiguous policy adherence in retail-like settings, and pipelines for auditing binary patches. For agent builders, they provide more realistic gating evals and highlight where current systems can fail despite strong static benchmark scores.

Details: What the papers do (methodology) - Benchmark construction emphasizes temporally extended scenarios with intermediate decision points, tool use, and ambiguous constraints; evaluation measures include end-to-end task outcomes, policy violations across steps, and misuse completion rates rather than only refusal. (http://arxiv.org/abs/2605.07630v1, http://arxiv.org/abs/2605.03952v1, http://arxiv.org/abs/2605.07699v1, http://arxiv.org/abs/2605.06601v1) Key technical contributions - Phone-agent “risky moments”: Scenarios isolate critical decision points where an agent must choose safe actions under uncertainty, capturing failures like overconfident guidance, escalation mistakes, or unsafe procedural steps. (http://arxiv.org/abs/2605.07630v1) - Sequenced coding misuse evaluation: Tests multi-step misuse where each step may appear benign but composes into harmful capability (e.g., exploit chains), measuring end-to-end completion and policy adherence across the sequence. (http://arxiv.org/abs/2605.03952v1) - Retail/policy ambiguity benchmark: Focuses on ambiguous, real-world policy constraints (returns, age restrictions, exceptions), stressing agents’ ability to follow policy when instructions are underspecified or conflicting. (http://arxiv.org/abs/2605.07699v1) - Binary patch auditing pipeline: Evaluates agent assistance for understanding and auditing binary patches, relevant to vulnerability triage and defensive security workflows (with dual-use implications). (http://arxiv.org/abs/2605.06601v1) Applications to agent systems - Orchestrators can use these benchmarks as regression suites for tool-use policies (when to escalate to a human, when to refuse, when to ask clarifying questions) rather than relying on generic jailbreak/refusal tests. (http://arxiv.org/abs/2605.07630v1, http://arxiv.org/abs/2605.07699v1) - Security posture: Sequenced misuse evals are particularly relevant for coding agents with shell/network tools; they motivate stronger sandboxing, least-privilege tool scopes, and step-level policy enforcement. (http://arxiv.org/abs/2605.03952v1) Practical integration notes - Add “temporal policy compliance” metrics: track violations per step and per tool call, not just final outcome. - Treat ambiguity as first-class: measure whether agents ask clarifying questions vs making unsafe assumptions in ambiguous policy contexts. (http://arxiv.org/abs/2605.07699v1)

Sources:

Importance: Enterprise adoption of agents is increasingly gated by multi-step safety/security behavior under tool use, not by single-turn refusal rates. These benchmarks/pipelines are strategically important because they (a) better predict real deployment risk, (b) can be integrated into CI-style evaluation for agent releases, and (c) pressure product teams to implement step-level controls (sandboxing, escalation, ambiguity handling) that become durable differentiators. (http://arxiv.org/abs/2605.03952v1, http://arxiv.org/abs/2605.07630v1)

5. LLM judges and evaluation reliability: policy invariance, constraint-level judging, benchmarkless safety scoring, and dynamic boundary evaluation

Summary: This cluster targets the reliability of LLM-as-judge, which is now core infrastructure for RLHF/RLVR data generation, automated benchmarking, and safety audits. The papers propose properties judges should satisfy (e.g., policy invariance), finer-grained evaluation at the level of constraints, frameworks for ‘benchmarkless’ safety scoring contracts, and adaptive evaluation that probes decision boundaries. Together, they aim to reduce silent evaluation drift and improve comparability across model versions.

Details: What the papers do (methodology) - The works define evaluation desiderata and then test judges under controlled perturbations (policy rephrasings, constraint decompositions, adaptive query generation) to measure stability, invariance, and boundary behavior. They often compare multiple judge models/prompts and quantify disagreement, sensitivity, and failure modes. (http://arxiv.org/abs/2605.06161v1, http://arxiv.org/abs/2605.03858v1, http://arxiv.org/abs/2605.06652v1, http://arxiv.org/abs/2605.06213v1) Key technical contributions - Policy invariance: Formalizes that judge outcomes should not change under semantically equivalent policy restatements or irrelevant formatting changes—highlighting a concrete class of brittleness that can invalidate safety claims if untested. (http://arxiv.org/abs/2605.06161v1) - Constraint-level judging: Breaks evaluations into explicit constraints (e.g., “must refuse X,” “must provide Y,” “must not mention Z”), enabling more diagnostic feedback and reducing ambiguity in what is being scored. (http://arxiv.org/abs/2605.03858v1) - Benchmarkless safety scoring: Proposes a contract-like framing where safety scoring is defined by scenario packs + rubrics + rerun budgets + stability requirements rather than a single fixed benchmark, aiming to make audits more reproducible and less gameable. (http://arxiv.org/abs/2605.06652v1) - Dynamic boundary evaluation: Uses adaptive testing to focus queries near the model/judge decision boundary, improving sensitivity to regressions and reducing overfitting to static test sets. (http://arxiv.org/abs/2605.06213v1) Applications to agent systems - RLVR/RLHF pipelines: If judges are used as reward models or preference labelers, invariance and boundary tests should be part of judge qualification; otherwise reward signals can drift silently and destabilize training. (http://arxiv.org/abs/2605.06161v1) - Agent policy enforcement: Constraint-level judging maps naturally to step-level agent governance (“tool call allowed?”, “PII exposure?”, “policy exception?”), enabling more granular automated audits of trajectories. (http://arxiv.org/abs/2605.03858v1) Practical integration notes - Add judge CI: run invariance suites whenever policy text, prompts, or judge models change; treat large sensitivity as a release blocker. (http://arxiv.org/abs/2605.06161v1) - Store rubrics as versioned artifacts: align with benchmarkless scoring guidance so audits can be rerun and compared across time. (http://arxiv.org/abs/2605.06652v1)

Sources:

Importance: Judge reliability is a strategic bottleneck: it affects post-training (RLHF/RLVR), automated evaluation, and external safety claims. These papers provide actionable mechanisms—policy invariance tests, constraint-level scoring, benchmarkless audit contracts, and adaptive boundary probing—that can be integrated into an agent platform’s evaluation stack to reduce drift, improve auditability, and strengthen enterprise trust. (http://arxiv.org/abs/2605.06161v1, http://arxiv.org/abs/2605.06213v1)

Additional Noteworthy Developments

Configuration-agnostic inference profiling for LLM serving

Summary: A profiling approach aims to reduce the operational overhead of performance tuning across changing model configurations and hardware setups.

Details: Targets configuration-agnostic performance characterization to speed iterative optimization and reduce regressions when swapping model variants or deployment parameters. (http://arxiv.org/abs/2605.07985v1)

Sources: [1]

Long-context prefill acceleration for hybrid-attention models

Summary: A systems paper accelerates the prefill stage for long-context inference in hybrid-attention architectures.

Details: Focuses on reducing prefill latency for long contexts in architectures that deviate from standard attention assumptions, improving tail latency under continuous batching. (http://arxiv.org/abs/2605.06221v1)

Sources: [1]

Token-budget coupling tax: when longer ‘thinking’ hurts

Summary: A study shows that under realistic output/token caps, allocating more tokens to reasoning can reduce answer quality by crowding out the final response.

Details: Introduces/quantifies a coupling between reasoning verbosity and answer truncation/quality under fixed budgets, motivating split-budget protocols. (http://arxiv.org/abs/2605.07686v1)

Sources: [1]

Answer-first then justify to mitigate trace/answer interference

Summary: A prompting/inference pattern improves reliability under tight budgets by producing the final answer before an optional justification.

Details: Demonstrates that ordering can reduce failures where long rationales crowd out or distort the final answer under constraints. (http://arxiv.org/abs/2605.06165v1)

Sources: [1]

Linear tool steering via activation interventions

Summary: Tool choice appears steerable with lightweight, approximately linear interventions in model activations.

Details: Suggests inference-time control over tool routing without retraining, potentially enabling rapid mitigation of misrouting failures. (http://arxiv.org/abs/2605.07990v1)

Sources: [1]

Memory inception via latent KV insertion

Summary: A steering method injects behavior/memory signals through latent KV mechanisms rather than visible prompt text.

Details: Enables mid-conversation behavior updates without transcript bloat, with implications for governance and auditability. (http://arxiv.org/abs/2605.06225v1)

Sources: [1]

Prompt-steering distillation

Summary: Distillation transfers prompt-based steering behaviors into the model to reduce reliance on long control prompts.

Details: Aims to preserve steering effects while lowering prompt overhead and improving consistency across contexts. (http://arxiv.org/abs/2605.03907v1)

Sources: [1]

Manifold/geometry-aware steering for more natural control

Summary: Steering methods that respect representation geometry aim to reduce artifacts from naive linear interventions.

Details: Proposes steering along manifold-aware directions to maintain output naturalness and reduce detectable distortions. (http://arxiv.org/abs/2605.05115v1)

Sources: [1]

Adaptive speculative decoding control

Summary: A decoding controller adaptively chooses speculation length to maximize speedups across conditions.

Details: Optimizes speculative decoding parameters online/step-wise to improve latency gains under varying acceptance rates. (http://arxiv.org/abs/2605.02888v1)

Sources: [1]

Correctness caveats for grammar-masked speculative decoding

Summary: A theoretical/algorithmic result shows that combining grammar masking with speculative decoding can change the sampled distribution.

Details: Highlights that common “grammar constraints + speculation” implementations may be biased, affecting guarantees for structured outputs (code/JSON/DSL). (http://arxiv.org/abs/2605.07698v1)

Sources: [1]

Text-to-SQL agent with adaptive exploration

Summary: A Text-to-SQL system uses adaptive exploration policies to balance cost on easy queries vs performance on hard ones.

Details: Emphasizes iterative database interaction (inspection/verification/repair loops) rather than single-shot SQL generation. (http://arxiv.org/abs/2605.08057v1)

Sources: [1]

Flexible DB interaction + repair/backtracking for Text-to-SQL

Summary: A Text-to-SQL agent design improves robustness via flexible interaction and explicit repair/backtracking.

Details: Implements multi-step correction loops that can recover from schema/value misunderstandings through targeted follow-up queries. (http://arxiv.org/abs/2605.02815v1)

Sources: [1]

First-token entropy as a single-pass uncertainty signal

Summary: A paper proposes using early-token entropy as a low-cost proxy for uncertainty/hallucination risk.

Details: Targets cheap telemetry that can trigger selective escalation (retrieval, extra sampling, or human review) without multi-sample overhead. (http://arxiv.org/abs/2605.05166v1)

Sources: [1]

Attention divergence features for hallucination detection

Summary: Attention-derived signals are evaluated as predictors of hallucination/instability.

Details: Uses internal attention behavior to build detectors, with the key open question being robustness across fine-tuning and architecture changes. (http://arxiv.org/abs/2605.05025v1)

Sources: [1]

Koopman/dynamical-systems framing for model uncertainty

Summary: A dynamical-systems perspective is used to analyze or estimate uncertainty/hallucination behavior.

Details: Reframes generation dynamics to derive alternative uncertainty indicators beyond standard logit-based heuristics. (http://arxiv.org/abs/2605.05134v1)

Sources: [1]

Neural-symbolic bridging for uncertainty/truthfulness signals

Summary: A neural-symbolic approach combines implicit model signals with more explicit structured judgments.

Details: Suggests hybridizing internal uncertainty cues with explicit symbolic checks/self-judgments to improve reliability. (http://arxiv.org/abs/2605.03971v1)

Sources: [1]

UniSD: unified self-distillation recipes

Summary: A unified framework standardizes self-distillation variants to reduce ad hoc recipe churn.

Details: Provides a consolidated experimental/ablation setup for comparing self-distillation choices and their effects. (http://arxiv.org/abs/2605.06597v1)

Sources: [1]

Preference-based self-distillation for reasoning

Summary: A self-distillation method uses preference-style objectives to improve outputs without explicit external rewards.

Details: Explores preference formulations as a substitute for RL in regimes where rewards are hard to specify. (http://arxiv.org/abs/2605.05040v1)

Sources: [1]

Limitations of on-policy self-distillation for long ‘thinking’ traces

Summary: Negative/limiting results indicate on-policy self-distillation may compress verbosity more than it corrects reasoning errors.

Details: Warns that optimizing on-policy traces can reduce length without improving correctness, separating “trace quality” from answer quality. (http://arxiv.org/abs/2605.06188v1)

Sources: [1]

Conformal guarantees for KGQA abstention/coverage

Summary: A KGQA method applies conformal calibration to provide coverage-style guarantees when abstaining.

Details: Uses conformal prediction to calibrate when the system should answer vs abstain, aiming for statistical reliability guarantees under stated assumptions. (http://arxiv.org/abs/2605.08077v1)

Sources: [1]

Structural/graphlet tokens for KG transfer

Summary: Graph-structure tokenization is proposed to improve transfer across heterogeneous knowledge graphs.

Details: Encodes local graph structure as tokens to reduce dependence on a single global entity vocabulary. (http://arxiv.org/abs/2605.06154v1)

Sources: [1]

Schema-aware process reward modeling for KG reasoning

Summary: A process-reward approach incorporates schema structure to improve multi-step KG traversal reliability.

Details: Uses schema-aware cumulative/process rewards to reduce miscrediting and risk-compensation behaviors in multi-path reasoning. (http://arxiv.org/abs/2605.02819v1)

Sources: [1]