Inconsistent Latency with Cohere Embed V4 Model on Azure AI Foundry

Question

Inconsistent Latency with Cohere Embed V4 Model on Azure AI Foundry

Hari 25

The Cohere Embed V4 model deployed via Azure AI Foundry shows inconsistent performance. Embedding very short inputs (e.g., a two-word phrase) sometimes returns within 1–2 seconds but occasionally takes 80–100 seconds. No changes have been made to input format or request frequency. This latency issue is severely affecting user experience.

0 comments

2 answers

Your answer

Answer 1

Christoph Zeiner 30

I’m experiencing the exact same issue and have already opened multiple support tickets, but it remains unresolved and continues to reoccur. I strongly urge Microsoft to address this latency problem. Unlike LLMs, embedding models cannot be easily swapped out, which makes the impact more severe. The latency is present across all regions, including global deployments. Even when running Microsoft’s own sample code (embedding just a few words), the request takes around 30 seconds—making the service practically unusable.

@Jerald Felix

0 comments

Answer 2

Hello Hari,

Sorry to hear you’re experiencing inconsistent latency with the Cohere Embed V4 model on Azure AI Foundry. Latency spikes—especially when embedding very short phrases—can be frustrating, especially since your inputs and request patterns haven’t changed.

Here are some steps and considerations based on similar cases and best practices for Azure-hosted AI models:

1.	Check for Azure Service Load or Outages

•	Sometimes, spikes in latency are due to backend service throttling or temporary platform congestion. Review your Azure region’s service health in the Azure Portal to rule out known incidents that might affect compute availability.

2.	Monitor & Separate Workloads

•	Mixing different kinds of workloads on a single endpoint can cause batching delays, where small/quick requests get queued behind longer ones. If you can, create a dedicated deployment for your embedding workload.

3.	Evaluate API Throttling or Quotas

•	Even if you don’t hit official quotas, bursts of activity or many parallel requests can cause cold starts or increased queuing. Try spreading requests more evenly over time and check the Azure Portal’s monitoring/metrics for any “throttling” events.

4.	Content Filtering and Model Configuration

•	Azure includes content filtering on deployed models. In some cases, the content filter step adds appreciable latency upon input/output, even for simple text. If your use case is low risk, you might inquire about or test with adjusted filtering settings to reduce added latency.

5.	Short Input Edge Cases

•	Some embedding models or hosting platforms may internally batch or pad very short inputs. It’s rare, but worth testing with slightly longer (3–4+ word) phrases to see if results differ consistently.

6.	Cold Start Delays

•	Serverless deployments (like Azure AI Foundry’s default) can sometimes incur start-up delays if the underlying resources have been scaled down due to inactivity. Try sending periodic “keep-alive” requests or move to a dedicated (non-serverless) plan if ultra-low-latency is critical.

7.	Check Model/SDK Versions

•	Make sure the SDKs (client-side) and deployed model versions match current Azure recommendations. Occasionally, mismatches or outdated SDKs can affect request parsing or response handling, causing unexpected slowdowns.

If you’ve tried these steps and are still seeing 80–100 second delays, it’s best to open a detailed support ticket with Azure, including timestamps, request IDs, and endpoint region. Sometimes, only direct backend logs can reveal the true source (such as internal node issues or unexpected queuing).

Let me know if you’d like further help on any of these steps, or if you discover new patterns in when the latency occurs!

Best regards,

Jerald Felix

Rohith Krishnakumar 0 Reputation points

2026-06-27T15:47:11.9833333+00:00

Hey @Jerald Felix , I've been experiencing the same issue intermittently and my use case has Cohere Embed v4 used as a Vectorizer in my Azure AI Search Index. Vector search queries work perfectly fine most of the time but sporadically face a half-day downtime where the latency is very noticeable and we've got to default back to simple search to continue keeping the index searchable. It's not even during vectorization of my document corpus but vectorizing the input query (often just two keywords) and fetching data from the index. Is there a definite resolution step or analysis that I can perform to identify the root cause of this issue?