Calling Azure OpenAI from a .NET API: patterns that hold up in production
Client lifecycle, HttpClient + Polly, streaming, configuration, and observability for Azure OpenAI behind ASP.NET Core—with examples and official references.
Integrating Azure OpenAI Service into a .NET API is easy in a demo; production is about timeouts, retries, streaming, quotas, and operability. Below is a condensed pattern set I use when the goal is predictable behavior under load—not a proof of concept.
1. One client, registered in DI
Constructing OpenAIClient (or the Azure-specific client from Azure.AI.OpenAI) per request wastes sockets and makes timeouts inconsistent. Register one singleton (or scoped, if you truly need isolation) and inject it.
// Program.cs — shape only; align with your SDK version
builder.Services.AddSingleton<OpenAIClient>(_ =>
{
var endpoint = new Uri(builder.Configuration["AzureOpenAI:Endpoint"]!);
var credential = new AzureKeyCredential(builder.Configuration["AzureOpenAI:Key"]!);
return new OpenAIClient(endpoint, credential);
});
Microsoft’s guidance on configuration and key management lives under Azure OpenAI quickstarts and work with models. Prefer managed identity and Key Vault in real environments over long-lived keys in appsettings.
2. Timeouts and retries: be explicit
Model endpoints return 429 (throttling), 5xx, and transient network faults. Combine:
- HttpClient timeouts aligned with your worst-case latency budget (see IHttpClientFactory).
- A retry policy with jitter and a max cumulative delay so one bad call cannot exhaust the thread pool.
Polly is the usual choice in .NET; Azure SDK clients often integrate with resilience policies—check your package version’s samples. Retry only idempotent operations (read-only inference is generally safe to retry; anything that charges twice or mutates external state is not—see your billing model).
Example policy shape (illustrative):
// Pseudocode: combine with OpenAIClient’s transport or wrap calls
Policy
.Handle<HttpRequestException>()
.Or<TaskCanceledException>()
.WaitAndRetryAsync(3, attempt =>
TimeSpan.FromMilliseconds(200 * Math.Pow(2, attempt)) + jitter);
Cap total wait (for example, under 8–15 seconds) to match your API’s SLA to upstream callers.
3. Streaming for interactive UX
For chat UIs, streaming token deltas improves perceived latency and lets you abort if the client disconnects. ASP.NET Core can return IAsyncEnumerable<string> or write chunked/SSE responses; the Azure OpenAI chat completions API supports streaming—see streaming chat completions.
Operational note: streaming complicates logging of “final answer” until completion—buffer or log incrementally according to retention policy (ties to privacy—see customer data and LLMs).
4. Configuration, not literals
Externalize per deployment:
| Setting | Why it matters |
|---|---|
| Deployment name / model | Swapping gpt-4o vs gpt-4o-mini is a cost lever |
max_tokens / temperature | Directly affects cost and variance |
| API version | Azure OpenAI is versioned—see API lifecycle |
A single IOptions<AzureOpenAIOptions> bound from configuration keeps code reviewable.
5. Observability: tokens, latency, outcomes
Log correlation IDs, model/deployment, latency, and prompt/completion tokens when policy allows. Azure Monitor and Application Insights integrate with App Service and containers—see Monitor Azure OpenAI.
Use this data to answer: Did last week’s prompt change reduce mean tokens per successful task without hurting success rate? That pairs with the evaluation approach in evaluating Copilot-style features.
6. Rate limits and backoff
Azure OpenAI enforces tokens-per-minute (TPM) and requests-per-minute (RPM) per deployment. When you hit 429, respect Retry-After when present and avoid tight spin loops. Capacity planning is documented under quotas and limits.
Further reading
- Azure OpenAI .NET quickstart
- OpenAI .NET library (GitHub) — API surface for chat, streaming, and structured outputs
- ASP.NET Core performance best practices — thread pool and async pitfalls
For workflow automation and cost tradeoffs, see spreadsheet to serverless and LLM cost controls. Questions about your stack: contact.