Briefing

The architectural wedge: routing inference, retrieval, and orchestration across heterogeneous compute

March 2026 · 6 min read

Text size

An architectural pattern for enterprises whose workload composition is shifting faster than the procurement cycle. The wedge separates the routing decision from the procurement decision, and lets the operating model absorb a workload shift without renegotiating a vendor contract.

The architectural pattern we describe in this briefing addresses a specific operational problem we have observed in every enterprise AI engagement we have run since the first quarter of 2025. The procurement cycle, written for a stable workload composition, no longer matches the rate at which the workload composition itself is changing. The wedge is a structural insertion between the workflow and the model providers that decouples the two, and lets the operating model absorb a workload shift without renegotiating a vendor contract.

01Why the procurement cycle no longer fits

The standard enterprise procurement cycle for a strategic vendor relationship runs between nine and 18 months. The cycle assumes a workload that is, in its essential composition, stable across the contract period. It assumes that the system the vendor is being procured to run will resemble, at the close of the contract, the system the vendor was procured to run at the start of it. For the workload classes that comprised the enterprise software estate before 2023, the assumption was generally sound. For the workload composition of an enterprise AI portfolio in 2026, the assumption is wrong by a factor of between four and seven.

The consequences are operational rather than financial. The financial consequences are real, and we have written about them elsewhere, but they are downstream. The operational consequence is that the architecture is forced to absorb a workload it was not procured for, and the operating model is forced to absorb an architecture it was not designed for, and the artifact specifying the relationship between the two does not contemplate the change.

02The shape of the wedge

The wedge sits between the workflow and the model providers. It is a thin architectural layer with three responsibilities, and no fourth responsibility, and the discipline of refusing the fourth is what makes the wedge work. The three responsibilities are routing, instrumentation, and policy enforcement. The wedge does not store state, it does not maintain conversation history, it does not hold a reasoning trace, and it does not retain anything that requires governance treatment beyond the routing decision itself.

The wedge is a place where decisions about which provider to call, which model to use, what the per-call cost envelope is, and which policies apply to the call are made and recorded. Everything else stays where it was. The workflow remains in the orchestrator. The state remains in the application. The reasoning remains in the model. The audit trail remains in the governance system. The wedge adds one layer, and removes nothing.

Responsibility 01

Routing. The decision about which model on which provider serves the call, against a written policy.

Responsibility 02

Instrumentation. The continuous measurement of cost, latency, and quality at the per-call level.

Responsibility 03

Policy enforcement. The runtime check that a call complies with the governance posture in effect at the moment of the call.

Excluded

State, conversation, reasoning trace, audit trail, business logic. The wedge holds none of these.

03Routing inference across heterogeneous compute

The routing decision is the first responsibility, and it is the responsibility most often misunderstood when the wedge is described. Routing in this architecture is not load balancing. It is not failover. It is the explicit, policy-driven, instrumented decision about which provider on which substrate runs the call, with the policy written at the workflow level rather than the call level.

The policy is expressed against three dimensions. The first is the quality envelope: the call must be served by a model that meets a threshold on the metric that matters to the workflow, and the threshold is specified by the workflow rather than by the wedge. The second is the cost envelope: the call may not exceed a per-call cost ceiling, and the ceiling is set by the operating model rather than by the procurement contract. The third is the latency envelope: the call must return within a deadline that is set by the workflow, and the wedge is responsible for routing to a provider that can meet the deadline at the cost.

04Retrieval as a first-class architectural concern

Retrieval, in the architectures we are advising against, is treated as a feature of the inference call. It is collapsed into the prompt construction, hidden inside a vendor abstraction, and not separately instrumented. In the architectures we are advising for, retrieval is treated as a first-class concern, separately instrumented, separately governed, and separately routable. The reason is that the cost surface of retrieval is structurally different from the cost surface of inference, and bundling the two prevents the operating model from instrumenting either correctly.

The retrieval layer in the wedge architecture sits adjacent to the routing layer rather than below it. A call to the wedge issues a routing decision and, separately, a retrieval decision. The retrieval decision selects an index, a corpus, and a recall envelope, and is instrumented against a cost-per-document-retrieved metric that is independent of the cost-per-token metric that governs the inference call.

05Orchestration above the wedge

Orchestration, the third concern, sits above the wedge rather than inside it. The orchestrator decides which calls to make, in which order, against which workflow state. The wedge decides where each call runs and how it is instrumented. Keeping the two concerns separate is the architectural discipline that lets the operating model rewrite the orchestrator without rewriting the routing layer, and lets the operating model rewrite the routing layer without rewriting the orchestrator.

The architectures that have absorbed the last two workload shifts cleanly are the architectures that kept the orchestrator and the routing layer in separate concerns. The architectures that have not are, in every case we have examined, architectures where the two were collapsed into a single component for engineering convenience at the time of the original build. Engagement note, technology portfolio, second quarter 2025

06Operating consequences

The operating consequence of the wedge is that the procurement cycle is decoupled from the workload composition. A workload shift that, in a coupled architecture, requires a vendor renegotiation, requires in the wedge architecture only a policy rewrite at the routing layer. The vendor relationships continue to run on the cycle they were procured against, the workload runs against the policy written at the operating-model level, and the two cycles are no longer required to match.

The wedge is not free. The instrumentation overhead is meaningful, the policy authoring discipline is non-trivial, and the engineering cost of maintaining the routing layer in production is real. The argument we are making is not that the wedge is cost-neutral. The argument is that the cost of the wedge is bounded, predictable, and absorbed at the operating-model level, while the cost of not having the wedge is the cost of a forced renegotiation against a workload that has already shifted.

How useful is this perspective for your operating model?Be the first to rate.

About the author

Raghav Ram, PhD is Managing Partner of Intelligine Group, where he leads the firm’s AI diagnostic and architecture engagements across financial services, healthcare, and industrials. He writes on the operating model, unit economics, and governance of enterprise AI.