Calling Azure OpenAI from a .NET API: patterns that hold up in production | Harsh Mehta - AI Automation & Full-Stack Engineer

Integrating Azure OpenAI Service into a .NET API is easy in a demo; production is about timeouts, retries, streaming, quotas, and operability. Below is a condensed pattern set I use when the goal is predictable behavior under load—not a proof of concept.

1. One client, registered in DI

Constructing OpenAIClient (or the Azure-specific client from Azure.AI.OpenAI) per request wastes sockets and makes timeouts inconsistent. Register one singleton (or scoped, if you truly need isolation) and inject it.

// Program.cs — shape only; align with your SDK version
builder.Services.AddSingleton<OpenAIClient>(_ =>
{
    var endpoint = new Uri(builder.Configuration["AzureOpenAI:Endpoint"]!);
    var credential = new AzureKeyCredential(builder.Configuration["AzureOpenAI:Key"]!);
    return new OpenAIClient(endpoint, credential);
});

Microsoft’s guidance on configuration and key management lives under Azure OpenAI quickstarts and work with models. Prefer managed identity and Key Vault in real environments over long-lived keys in appsettings.

2. Timeouts and retries: be explicit

Model endpoints return 429 (throttling), 5xx, and transient network faults. Combine:

HttpClient timeouts aligned with your worst-case latency budget (see IHttpClientFactory).
A retry policy with jitter and a max cumulative delay so one bad call cannot exhaust the thread pool.

Polly is the usual choice in .NET; Azure SDK clients often integrate with resilience policies—check your package version’s samples. Retry only idempotent operations (read-only inference is generally safe to retry; anything that charges twice or mutates external state is not—see your billing model).

Example policy shape (illustrative):

// Pseudocode: combine with OpenAIClient’s transport or wrap calls
Policy
    .Handle<HttpRequestException>()
    .Or<TaskCanceledException>()
    .WaitAndRetryAsync(3, attempt =>
        TimeSpan.FromMilliseconds(200 * Math.Pow(2, attempt)) + jitter);

Cap total wait (for example, under 8–15 seconds) to match your API’s SLA to upstream callers.

3. Streaming for interactive UX

For chat UIs, streaming token deltas improves perceived latency and lets you abort if the client disconnects. ASP.NET Core can return IAsyncEnumerable<string> or write chunked/SSE responses; the Azure OpenAI chat completions API supports streaming—see streaming chat completions.

Operational note: streaming complicates logging of “final answer” until completion—buffer or log incrementally according to retention policy (ties to privacy—see customer data and LLMs).

4. Configuration, not literals

Externalize per deployment:

Setting	Why it matters
Deployment name / model	Swapping `gpt-4o` vs `gpt-4o-mini` is a cost lever
`max_tokens` / `temperature`	Directly affects cost and variance
API version	Azure OpenAI is versioned—see API lifecycle

A single IOptions<AzureOpenAIOptions> bound from configuration keeps code reviewable.

5. Observability: tokens, latency, outcomes

Log correlation IDs, model/deployment, latency, and prompt/completion tokens when policy allows. Azure Monitor and Application Insights integrate with App Service and containers—see Monitor Azure OpenAI.

Use this data to answer: Did last week’s prompt change reduce mean tokens per successful task without hurting success rate? That pairs with the evaluation approach in evaluating Copilot-style features.

6. Rate limits and backoff

Azure OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) per deployment. When you hit 429, respect Retry-After when present and avoid tight spin loops. Capacity planning is documented under quotas and limits.