USUL

Created: April 13, 2026 at 8:04 AM

ACADEMIC RESEARCH - 2026-04-13

Executive Summary

BadSkill (agent skill supply-chain backdoors): Shows that installable agent “skills” can bundle opaque learned models with stealthy backdoors, shifting marketplace security from prompt/code review to full ML artifact provenance and runtime monitoring.
Harmfulness is localized/compressible in weights: Finds harmful behavior can be traced to a compact subset of parameters via pruning-based causal probing, suggesting more surgical mitigation/unlearning may preserve capabilities while improving safety robustness.
Process Reward Agents (PRA) for retrieval-grounded search: Introduces online, step-wise, retrieval-grounded reward signals to guide inference-time search/decoding with a frozen policy, improving reliability without full RL retraining.

Top Priority Items

1. BadSkill: supply-chain backdoor attack via bundled models inside installable agent skills

Summary: BadSkill demonstrates a realistic supply-chain attack on agent ecosystems: third-party installable “skills” can ship embedded learned models (weights) that contain backdoors, enabling malicious behavior under subtle semantic triggers while appearing benign under typical review. The work reframes agent security as an ML artifact supply-chain problem (provenance, signing, reproducibility, and runtime monitoring), not just prompt injection or tool permissioning.

Details: Methodology and threat model - The paper constructs an agent-skill marketplace scenario where skills are distributed as installable packages that may include both code and bundled model weights, and evaluates attacker success when defenders rely on conventional controls (e.g., code review, basic prompt-trigger tests). The attack hinges on the fact that a skill can include an auxiliary model that influences downstream agent decisions while remaining opaque to static inspection. (http://arxiv.org/abs/2604.09378v1) Key technical contributions - Bundled-model backdoor vector: The paper’s core contribution is articulating and empirically validating a backdoor channel that lives inside a shipped model artifact within a skill package, rather than in prompts or tool schemas. This matters because many agent platforms treat “skills/plugins” as code plus metadata, while weights are effectively un-audited binaries. (http://arxiv.org/abs/2604.09378v1) - Semantic/parameter-combination triggering: The work emphasizes triggers that are not obvious strings but can be routine-looking parameter combinations or semantic conditions, which reduces the effectiveness of naive red-team checks that scan for known trigger phrases. (http://arxiv.org/abs/2604.09378v1) Results (what is shown) - BadSkill shows that a skill can behave normally in common test cases yet reliably misbehave when a hidden trigger condition is met, illustrating a high-leverage compromise point for agent ecosystems where skills are widely shared and reused. (http://arxiv.org/abs/2604.09378v1) Applications to agent systems (defensive design patterns) - Marketplace controls for ML artifacts: Treat skill weights as first-class supply-chain objects: require provenance metadata, cryptographic signing, and reproducible-build style pipelines (e.g., “build from source” for training recipes where feasible) rather than accepting arbitrary binaries. (http://arxiv.org/abs/2604.09378v1) - Backdoor-oriented skill evaluation: Add semantic-trigger testing and fuzzing over tool arguments/parameter spaces (not just prompt strings), plus canary tasks that probe for policy-violating behavior under innocuous-looking inputs. (http://arxiv.org/abs/2604.09378v1) - Runtime containment: Enforce least privilege and sandboxing for skills, and implement continuous telemetry/anomaly detection on skill actions (e.g., unusual tool-call patterns, exfiltration-like behavior), since pre-install vetting may miss learned triggers. (http://arxiv.org/abs/2604.09378v1) Strategic importance for an agentic infrastructure startup - Product roadmap impact: If you plan a skill/plugin ecosystem, this paper argues for building a “model artifact security layer” early (signing, attestations, policy-enforced execution, and post-install monitoring) rather than bolting it on later. - Competitive relevance: Platforms that can credibly certify third-party skills (including bundled weights) will have an enterprise advantage; BadSkill provides a concrete, research-backed justification for stricter marketplace governance and for charging for “verified skills.” (http://arxiv.org/abs/2604.09378v1)

Sources:

[1] http://arxiv.org/abs/2604.09378v1

Importance: This is directly actionable for agent marketplaces and orchestration runtimes: it expands the threat surface to include opaque ML binaries shipped inside skills, implying you need provenance, signing/attestation, and runtime behavioral monitoring as core platform features, not optional add-ons. (http://arxiv.org/abs/2604.09378v1)

2. Harmfulness is localized/compressible in LLM weights (pruning-based causal probe)

Summary: This paper argues that harmful behaviors in LLMs are localized and compressible: a relatively small subset of parameters can disproportionately drive harmful outputs, as identified via pruning-based causal probing. The result suggests targeted interventions (pruning, surgical unlearning, constrained fine-tuning) may mitigate harm while preserving broad capabilities, and it offers a mechanistic lens on why later fine-tunes can reintroduce misalignment.

Details: Methodology - The authors use pruning-based causal probing to identify parameter subsets that are most causally responsible for harmful behavior, then test how removing/attenuating those parameters affects harmfulness versus general capability. The approach is positioned as a mechanistic probe rather than purely behavioral evaluation: it attempts to localize where harmful behavior “lives” in the weights. (http://arxiv.org/abs/2604.09544v1) Key results - Localization/compressibility finding: Harmfulness can be substantially reduced by intervening on a compact subset of weights, implying harmful behavior is not uniformly distributed across the network. (http://arxiv.org/abs/2604.09544v1) - Capability preservation: The paper reports that targeted interventions can reduce harmful outputs while retaining more benign capability than blunt, global modifications would. (http://arxiv.org/abs/2604.09544v1) Technical contributions - A concrete causal-probing recipe for “harm circuits” using pruning as an intervention mechanism, turning the abstract question “where is harmfulness represented?” into a measurable, engineering-relevant workflow. (http://arxiv.org/abs/2604.09544v1) Potential applications to agent systems - Safer model lifecycle for agent platforms: If you ship fine-tuned models for tools/agents, this suggests adding a post-fine-tune “harm-localization regression test” and optionally a surgical mitigation step before deployment. (http://arxiv.org/abs/2604.09544v1) - Defense-in-depth against safety regression: Because the paper motivates the idea that alignment constraints may be fragile under subsequent fine-tuning, agent stacks that do continual learning (or frequent adapter updates) should consider continuous safety constraints and automated checks as part of CI/CD for models. (http://arxiv.org/abs/2604.09544v1) - New evaluation axis: Track “harmfulness compression/localization” metrics as a robustness indicator—e.g., if harmfulness becomes concentrated in a small subspace after training, later updates might more easily re-activate it. (http://arxiv.org/abs/2604.09544v1) Strategic importance - Roadmap: Enables a plausible path to capability-preserving safety improvements (surgical unlearning/pruning) that could be integrated into model release pipelines for agent products. - Competitive relevance: If validated across model families, teams that can operationalize targeted harm mitigation may deliver stronger safety guarantees with less capability loss than broad behavior shaping, which is valuable for enterprise adoption. (http://arxiv.org/abs/2604.09544v1)

Sources:

[1] http://arxiv.org/abs/2604.09544v1

Importance: For agent builders, this supports investing in mechanistic, parameter-level safety tooling (localization, targeted unlearning) as part of model ops—especially important when models are frequently adapted for tools, domains, or customer policies. (http://arxiv.org/abs/2604.09544v1)

3. Process Reward Agents (PRA): online, retrieval-grounded step-wise rewards for search-based decoding

Summary: PRA proposes using an online agent to produce step-wise reward signals grounded in retrieval, guiding inference-time search/decoding for a frozen base model. This shifts some reliability gains from expensive training-time RL to test-time compute, with reward signals tied to external evidence—particularly relevant for domains like medicine where verifiability matters.

Details: Methodology - The paper frames decoding as a search problem where partial solutions (intermediate reasoning steps) can be evaluated online. PRA provides step-wise rewards by grounding intermediate steps against retrieved evidence, then uses these rewards to steer search-based decoding without updating the underlying policy model. (http://arxiv.org/abs/2604.09482v1) Key technical contributions - Online, step-level reward generation: Unlike post-hoc process reward models that score completed traces, PRA emphasizes producing reward signals during generation, enabling dynamic pruning/expansion of candidate continuations. (http://arxiv.org/abs/2604.09482v1) - Retrieval-grounded verification channel: Rewards are conditioned on retrieved information, making the scoring function more tied to external evidence than purely model-internal heuristics. (http://arxiv.org/abs/2604.09482v1) Key results - The paper reports improved reliability from inference-time search guided by retrieval-grounded step rewards, demonstrating that meaningful gains can be achieved without full retraining of the base model. (http://arxiv.org/abs/2604.09482v1) Applications to agent systems - Orchestration pattern: “Generate → retrieve → score step → branch/continue” can be implemented as a reusable agent runtime primitive, aligning with tool-using agents that already interleave planning and calls to search/KB tools. (http://arxiv.org/abs/2604.09482v1) - Regulated-domain agents: PRA-like reward channels can be tied to approved corpora (clinical guidelines, internal policies) to bias reasoning toward auditable evidence. - Safety considerations: Because retrieval becomes part of the reward channel, the paper implies a new attack surface—retrieval poisoning/manipulation can directly steer decoding—so retrieval integrity (signing, allowlists, provenance) becomes a safety requirement, not just a quality feature. (http://arxiv.org/abs/2604.09482v1) Strategic importance - Product leverage: If you operate an agent platform, PRA suggests you can ship reliability improvements via runtime search + verification components, potentially faster than retraining cycles. - Competitive relevance: Test-time compute and verifiable reasoning are converging industry trends; PRA provides an academically grounded design that can differentiate “enterprise-grade” agents via evidence-grounded intermediate checks. (http://arxiv.org/abs/2604.09482v1)

Sources:

[1] http://arxiv.org/abs/2604.09482v1

Importance: PRA is directly relevant to agentic infrastructure because it turns verification/retrieval into a first-class, online control signal for decoding—an approach that can be integrated at the orchestrator layer and improved independently of the base model. (http://arxiv.org/abs/2604.09482v1)

Additional Noteworthy Developments

Many-Tier Instruction Hierarchy (ManyIH) and ManyIH-Bench for resolving conflicts across many privilege levels

Summary: ManyIH formalizes instruction-following under many privilege levels and introduces a benchmark to evaluate whether models resolve deep policy conflicts correctly. (http://arxiv.org/abs/2604.09443v1)

Details: The paper extends beyond system/developer/user role hierarchies to multi-level constraint stacks and evaluates conflict resolution behavior, making it a practical target for enterprise agent governance testing. (http://arxiv.org/abs/2604.09443v1)

Sources: [1]

RecaLLM: interleaving reasoning with explicit in-context retrieval to mitigate 'lost-in-thought'

Summary: RecaLLM targets long-context failures where retrieval degrades after extended reasoning and proposes an interleaved reason–retrieve approach. (http://arxiv.org/abs/2604.09494v1)

Details: By explicitly alternating reasoning with retrieval/recall, the method aims to preserve effective context use in long-horizon agent workflows without simply increasing context length. (http://arxiv.org/abs/2604.09494v1)

Sources: [1]

NetForge_RL: high-fidelity cyber defense simulator bridging Sim2Real for MARL

Summary: NetForge_RL provides a dual-mode cyber defense environment intended to train at scale while enabling more realistic evaluation. (http://arxiv.org/abs/2604.09523v1)

Details: The simulator is positioned to reduce the gap between toy cyber ranges and realistic execution environments, improving credibility of agent evaluation under partial observability and realistic telemetry. (http://arxiv.org/abs/2604.09523v1)

Sources: [1]

VL-Calibration: decoupled visual vs reasoning confidence calibration for LVLMs via RL

Summary: VL-Calibration proposes separately calibrating perception confidence and reasoning confidence in LVLMs using RL-style optimization without explicit perception labels. (http://arxiv.org/abs/2604.09529v1)

Details: The decoupling targets high-certainty hallucinations by distinguishing “I didn’t see it” from “I can’t infer it,” improving selective deferral and human-in-the-loop triggers in multimodal agents. (http://arxiv.org/abs/2604.09529v1)

Sources: [1]

HiL-Bench: benchmark for selective escalation (asking clarifying questions) in coding agents

Summary: HiL-Bench evaluates whether coding agents appropriately ask clarifying questions instead of guessing under underspecified tasks. (http://arxiv.org/abs/2604.09408v1)

Details: It operationalizes “escalation as success” with human-validated blockers and anti-gaming metrics, pushing agent designs toward explicit uncertainty detection and clarification policies. (http://arxiv.org/abs/2604.09408v1)

Sources: [1]

Analog electro-optic Softmax/Sigmoid using thin-film lithium niobate modulators to cut transformer latency

Summary: Demonstrates an analog photonic approach to Softmax/Sigmoid intended to reduce transformer inference latency and study quantization/noise robustness. (http://arxiv.org/abs/2604.09512v1)

Details: If the reported robustness and latency benefits translate beyond lab settings, it supports hybrid digital–analog accelerator designs where Softmax is on the critical path. (http://arxiv.org/abs/2604.09512v1)

Sources: [1]

Jackal: large-scale execution-based NL-to-JQL benchmark + tool-augmented agent baseline

Summary: Jackal introduces an execution-based benchmark for translating natural language to Jira Query Language (JQL) with tool-augmented agent baselines. (http://arxiv.org/abs/2604.09470v1)

Details: By evaluating against live execution behavior (schema/value ambiguity, boolean logic), it better matches enterprise tool-use conditions and encourages retrieval+verification loops. (http://arxiv.org/abs/2604.09470v1)

Sources: [1]

E3-TIR: warm-up paradigm blending expert prefixes, guided exploration, and self-exploration for tool-integrated reasoning

Summary: E3-TIR proposes a warm-up training recipe combining expert prefixes with guided and self-exploration to improve early-stage tool-integrated reasoning training. (http://arxiv.org/abs/2604.09455v1)

Details: It targets exploration inefficiency/mode collapse in tool-RL setups and may reduce time-to-competence before heavier RL stages, depending on reproducibility across tasks/models. (http://arxiv.org/abs/2604.09455v1)

Sources: [1]

Safe continual RL policy updates via Rashomon set projection with formal guarantees

Summary: Proposes maintaining safety during continual RL updates by projecting learned policies onto a certified safe Rashomon set. (http://arxiv.org/abs/2604.09452v1)

Details: The method is positioned as an update-time safety filter with formal guarantees tied to the demonstration/data distribution, suggesting a template for safer iterative improvement loops. (http://arxiv.org/abs/2604.09452v1)

Sources: [1]

Algorithmic monoculture in LLM multi-agent coordination: baseline similarity and incentive-driven regulation

Summary: Finds LLM agents exhibit high baseline behavioral similarity and limited sustained diversity even when diversity is incentivized. (http://arxiv.org/abs/2604.09502v1)

Details: This suggests multi-agent ensembles/debate may deliver less robustness via diversity than assumed, motivating architectural/training interventions rather than relying on incentives alone. (http://arxiv.org/abs/2604.09502v1)

Sources: [1]

TRouter: cold-start robust LLM routing via task-taxonomy-guided data synthesis and latent task types

Summary: TRouter addresses cold-start LLM routing using task taxonomies, synthetic data, and latent task-type modeling. (http://arxiv.org/abs/2604.09377v1)

Details: It proposes reducing labeled routing data needs by structuring tasks and learning latent types, aiming to improve cost/performance routing when deploying new products or domains. (http://arxiv.org/abs/2604.09377v1)

Sources: [1]

VISOR: agentic Visual RAG with iterative search and over-horizon reasoning for multi-page evidence

Summary: VISOR proposes an agentic visual RAG system with iterative search to gather multi-page evidence for visual document reasoning. (http://arxiv.org/abs/2604.09508v1)

Details: It emphasizes structured evidence selection over brute-force context expansion, improving auditability and reliability for enterprise document QA workflows. (http://arxiv.org/abs/2604.09508v1)

Sources: [1]

EgoTL: think-aloud egocentric dataset/pipeline for long-horizon household tasks with spatial calibration

Summary: EgoTL introduces an egocentric dataset/pipeline with temporally aligned think-aloud reasoning and metric spatial calibration for long-horizon tasks. (http://arxiv.org/abs/2604.09535v1)

Details: The “say-before-act” supervision channel plus spatial grounding aims to reduce planning hallucinations and improve faithfulness in embodied instruction following. (http://arxiv.org/abs/2604.09535v1)

Sources: [1]

VLM depth study: visual token representations converge early and become interchangeable across layers

Summary: Analyzes VLMs and reports that visual token representations stabilize early and become increasingly interchangeable across depth. (http://arxiv.org/abs/2604.09425v1)

Details: The finding suggests architectural inefficiency and motivates compute-saving techniques like early exiting, token dropping, or shallower visual processing stacks. (http://arxiv.org/abs/2604.09425v1)

Sources: [1]

Encoding–Grounding Dissociation in VLMs: 'seeing' evidence encoded but not used due to arbitration

Summary: Shows a dissociation where VLMs may encode visual evidence but fail to use it due to late-stage arbitration between priors and evidence. (http://arxiv.org/abs/2604.09364v1)

Details: Reframes some multimodal errors as fusion/arbitration failures, implying interventions should target late-layer grounding and decision mechanisms rather than only improving visual encoders. (http://arxiv.org/abs/2604.09364v1)

Sources: [1]

Sparse point-trajectory diffusion for open-set future scene dynamics from a single image

Summary: Proposes sparse point-trajectory diffusion to forecast open-set future dynamics from a single image with multi-modality and uncertainty. (http://arxiv.org/abs/2604.09527v1)

Details: A representation shift from dense video prediction to sparse trajectories may reduce compute while supporting constraint-guided rollouts useful for planning stacks. (http://arxiv.org/abs/2604.09527v1)

Sources: [1]

Learning-to-Defer with advice: inconsistency of separated surrogates and a consistent augmented surrogate

Summary: Formalizes learning-to-defer where the system also chooses what advice/context to provide to the expert, and proposes a consistent surrogate objective. (http://arxiv.org/abs/2604.09414v1)

Details: Warns that common separated-head surrogate training can be inconsistent and provides an augmented surrogate aligned with the joint defer+advice objective, relevant to escalation/tool-routing design. (http://arxiv.org/abs/2604.09414v1)

Sources: [1]

Automated Instruction Revision (AIR) positioned among adaptation methods; performance is task-dependent

Summary: Compares AIR to other adaptation methods and finds performance is task-dependent, with AIR particularly effective for label remapping. (http://arxiv.org/abs/2604.09418v1)

Details: Provides selection guidance for adaptation portfolios (prompting vs retrieval vs fine-tuning vs rule-like revision), rather than a universal new best method. (http://arxiv.org/abs/2604.09418v1)

Sources: [1]

Survey of credit assignment methods for LLM RL across reasoning and agentic regimes

Summary: Surveys credit assignment approaches for RL on LLMs across reasoning-centric and agentic long-horizon settings. (http://arxiv.org/abs/2604.09459v1)

Details: Consolidates terminology and method families (sparse rewards, long horizons, tool use), helping practitioners choose approaches and identify open problems. (http://arxiv.org/abs/2604.09459v1)

Sources: [1]

AI Codebase Maturity Model (ACMM) experience report on AI-driven development feedback loops

Summary: Experience report proposing an AI Codebase Maturity Model emphasizing tests/CI/metrics as multipliers for AI coding tools. (http://arxiv.org/abs/2604.09388v1)

Details: Argues AI coding productivity depends on surrounding feedback loops and provides a framework to assess readiness and reduce regression risk. (http://arxiv.org/abs/2604.09388v1)

Sources: [1]

Capacity-derived semantics and communication phase transition via quotient POMDP abstractions

Summary: Develops a theoretical account of semantics/communication under bounded capacity using quotient POMDP abstractions and phase-transition behavior. (http://arxiv.org/abs/2604.09521v1)

Details: Suggests hard limits and threshold effects in intent-preserving communication below certain rates/capacities, potentially informing abstraction and protocol design if operationalized. (http://arxiv.org/abs/2604.09521v1)

Sources: [1]

Epidemiological world models framework for controlled partially observed epidemic dynamics

Summary: Conceptual framework for world models in epidemiology emphasizing controlled, partially observed dynamics. (http://arxiv.org/abs/2604.09519v1)

Details: Positions surveillance as endogenous and policy-dependent, motivating sequential decision-making formulations; primarily niche to public health policy agents. (http://arxiv.org/abs/2604.09519v1)

Sources: [1]

Quantum-inspired document embeddings framework with hybrid retrieval diagnostics

Summary: Proposes a quantum-inspired embedding construction and diagnostics for hybrid retrieval setups. (http://arxiv.org/abs/2604.09430v1)

Details: Contributes tooling/diagnostics for BM25+embedding hybrid tuning and reproducible embedding experiments; differentiation vs strong baselines remains to be established. (http://arxiv.org/abs/2604.09430v1)

Sources: [1]

EpiAgent: hierarchical LLM-agent system for restoring degraded ancient inscriptions

Summary: Presents a hierarchical LLM-agent workflow for restoring degraded ancient inscriptions as a domain case study. (http://arxiv.org/abs/2604.09367v1)

Details: Demonstrates plan/execute/reevaluate orchestration with tools in a niche domain and highlights evaluation challenges for iterative expert-in-the-loop restoration. (http://arxiv.org/abs/2604.09367v1)

Sources: [1]