Hello Hari,
Sorry to hear you’re experiencing inconsistent latency with the Cohere Embed V4 model on Azure AI Foundry. Latency spikes—especially when embedding very short phrases—can be frustrating, especially since your inputs and request patterns haven’t changed.
Here are some steps and considerations based on similar cases and best practices for Azure-hosted AI models:
1. Check for Azure Service Load or Outages
• Sometimes, spikes in latency are due to backend service throttling or temporary platform congestion. Review your Azure region’s service health in the Azure Portal to rule out known incidents that might affect compute availability.
2. Monitor & Separate Workloads
• Mixing different kinds of workloads on a single endpoint can cause batching delays, where small/quick requests get queued behind longer ones. If you can, create a dedicated deployment for your embedding workload.
3. Evaluate API Throttling or Quotas
• Even if you don’t hit official quotas, bursts of activity or many parallel requests can cause cold starts or increased queuing. Try spreading requests more evenly over time and check the Azure Portal’s monitoring/metrics for any “throttling” events.
4. Content Filtering and Model Configuration
• Azure includes content filtering on deployed models. In some cases, the content filter step adds appreciable latency upon input/output, even for simple text. If your use case is low risk, you might inquire about or test with adjusted filtering settings to reduce added latency.
5. Short Input Edge Cases
• Some embedding models or hosting platforms may internally batch or pad very short inputs. It’s rare, but worth testing with slightly longer (3–4+ word) phrases to see if results differ consistently.
6. Cold Start Delays
• Serverless deployments (like Azure AI Foundry’s default) can sometimes incur start-up delays if the underlying resources have been scaled down due to inactivity. Try sending periodic “keep-alive” requests or move to a dedicated (non-serverless) plan if ultra-low-latency is critical.
7. Check Model/SDK Versions
• Make sure the SDKs (client-side) and deployed model versions match current Azure recommendations. Occasionally, mismatches or outdated SDKs can affect request parsing or response handling, causing unexpected slowdowns.
If you’ve tried these steps and are still seeing 80–100 second delays, it’s best to open a detailed support ticket with Azure, including timestamps, request IDs, and endpoint region. Sometimes, only direct backend logs can reveal the true source (such as internal node issues or unexpected queuing).
Let me know if you’d like further help on any of these steps, or if you discover new patterns in when the latency occurs!
Best regards,
Jerald Felix