Inconsistent Latency with Cohere Embed V4 Model on Azure AI Foundry

Hari 25 Reputation points
2025-07-30T06:58:17.41+00:00

The Cohere Embed V4 model deployed via Azure AI Foundry shows inconsistent performance. Embedding very short inputs (e.g., a two-word phrase) sometimes returns within 1–2 seconds but occasionally takes 80–100 seconds. No changes have been made to input format or request frequency. This latency issue is severely affecting user experience.

Foundry Tools
Foundry Tools

Formerly known as Azure AI Services or Azure Cognitive Services is a unified collection of prebuilt AI capabilities within the Microsoft Foundry platform

0 comments No comments

2 answers

Sort by: Most helpful
  1. Christoph Zeiner 30 Reputation points
    2025-09-13T18:06:56.57+00:00

    I’m experiencing the exact same issue and have already opened multiple support tickets, but it remains unresolved and continues to reoccur. I strongly urge Microsoft to address this latency problem. Unlike LLMs, embedding models cannot be easily swapped out, which makes the impact more severe. The latency is present across all regions, including global deployments. Even when running Microsoft’s own sample code (embedding just a few words), the request takes around 30 seconds—making the service practically unusable.

    @Jerald Felix

    Was this answer helpful?

    0 comments No comments

  2. Jerald Felix 15,690 Reputation points Volunteer Moderator
    2025-07-31T05:02:23.8866667+00:00

    Hello Hari,

    Sorry to hear you’re experiencing inconsistent latency with the Cohere Embed V4 model on Azure AI Foundry. Latency spikes—especially when embedding very short phrases—can be frustrating, especially since your inputs and request patterns haven’t changed.

    Here are some steps and considerations based on similar cases and best practices for Azure-hosted AI models:

    1.	Check for Azure Service Load or Outages
    
    •	Sometimes, spikes in latency are due to backend service throttling or temporary platform congestion. Review your Azure region’s service health in the Azure Portal to rule out known incidents that might affect compute availability.
    
    2.	Monitor & Separate Workloads
    
    •	Mixing different kinds of workloads on a single endpoint can cause batching delays, where small/quick requests get queued behind longer ones. If you can, create a dedicated deployment for your embedding workload.
    
    3.	Evaluate API Throttling or Quotas
    
    •	Even if you don’t hit official quotas, bursts of activity or many parallel requests can cause cold starts or increased queuing. Try spreading requests more evenly over time and check the Azure Portal’s monitoring/metrics for any “throttling” events.
    
    4.	Content Filtering and Model Configuration
    
    •	Azure includes content filtering on deployed models. In some cases, the content filter step adds appreciable latency upon input/output, even for simple text. If your use case is low risk, you might inquire about or test with adjusted filtering settings to reduce added latency.
    
    5.	Short Input Edge Cases
    
    •	Some embedding models or hosting platforms may internally batch or pad very short inputs. It’s rare, but worth testing with slightly longer (3–4+ word) phrases to see if results differ consistently.
    
    6.	Cold Start Delays
    
    •	Serverless deployments (like Azure AI Foundry’s default) can sometimes incur start-up delays if the underlying resources have been scaled down due to inactivity. Try sending periodic “keep-alive” requests or move to a dedicated (non-serverless) plan if ultra-low-latency is critical.
    
    7.	Check Model/SDK Versions
    
    •	Make sure the SDKs (client-side) and deployed model versions match current Azure recommendations. Occasionally, mismatches or outdated SDKs can affect request parsing or response handling, causing unexpected slowdowns.
    

    If you’ve tried these steps and are still seeing 80–100 second delays, it’s best to open a detailed support ticket with Azure, including timestamps, request IDs, and endpoint region. Sometimes, only direct backend logs can reveal the true source (such as internal node issues or unexpected queuing).

    Let me know if you’d like further help on any of these steps, or if you discover new patterns in when the latency occurs!

    Best regards,

    Jerald Felix

    Was this answer helpful?


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.