Azure AI Doc Intelligence data extraction in salesforce

Question

Azure AI Doc Intelligence data extraction in salesforce

IT Cognity 0

Hello Microsoft Team,

We are using Azure AI Document Intelligence in a Salesforce-based digital onboarding flow for document classification and data extraction.

Our current flow is:

Document upload → Azure Document Intelligence custom classification model → Routing to the corresponding extraction model based on the predicted document type → Field extraction and business validation

The classifier is used to identify different types of identification documents, such as older ID cards, newer ID cards, passports, residence permits, security force IDs, etc.

To handle cases where users upload documents that are not valid identification documents, we have also added a separate class/category named other.

For the other class, we have used training samples that represent non-ID documents we expect users may upload by mistake, such as utility bills, property-related documents, and similar business documents. For these expected non-ID documents, the classifier generally works correctly and classifies them as other.

However, our issue is with completely irrelevant or unexpected files/images. For example, random images, screenshots, arbitrary documents, or files that are clearly not identification documents are sometimes classified as one of the ID document classes instead of other.

In some of these cases, the classification confidence returned by Azure is also high. We have observed cases where the confidence score for an irrelevant file classified as an ID type is higher than the confidence score for a valid ID document classified correctly. Because of this, using only a confidence threshold is risky for our use case, since it could either fail to reject irrelevant files or incorrectly reject valid ID documents.

We would like Microsoft’s official guidance on the expected behavior and recommended design for this scenario.

Specifically:

Would Microsoft recommend a different architecture for this use case?
If the other class is expected to work for this scenario, what type of training data should be included in it?
Is there a recommended maximum or practical limit for how broad the other class should be before it becomes ineffective or negatively affects classification between valid ID document classes?
Since the confidence score is not consistently lower for irrelevant files, what is Microsoft’s recommended approach for rejecting documents that do not belong to any valid ID category?
More generally, what is the recommended way to handle out-of-scope / out-of-distribution documents in Azure Document Intelligence custom classification, especially when the possible invalid inputs are very broad and cannot be fully represented in training data?

Our goal is to understand whether the current behavior is expected product behavior, whether a broad other class can reliably solve this problem, and what Microsoft recommends as the best-practice approach for rejecting irrelevant files without incorrectly rejecting valid ID documents.

Thanmayi Godithi 10,820 Reputation points Microsoft External Staff Moderator

2026-07-03T08:34:57.21+00:00
Hi IT Cognity ,

Thank you for reaching out on Microsoft Q&A forum.

Azure AI Document Intelligence is the extraction engine — it returns structured JSON (fields + confidence) via an asynchronous REST API/SDK. There's no built-in Salesforce connector, so the pattern is: call Document Intelligence, then have an integration layer write the results into Salesforce.

The Document Intelligence call (prebuilt or your custom model):

POST https://<resource>.cognitiveservices.azure.com/documentintelligence/documentModels/<modelId>:analyze?api-version=2024-11-30 Ocp-Apim-Subscription-Key: <key> { "urlSource": "<blob-SAS-url>" } # or base64Source / bytes → 202 Accepted; read the Operation-Location header GET <Operation-Location> → poll until status "succeeded" → analyzeResult.documents[].fields (value + confidence)

It's async by design — whatever calls it must poll Operation-Location; there's no synchronous fields-now call.

Recommended Azure-native orchestration:

Logic Apps — the built-in Document Intelligence action handles the polling for you, then the Salesforce connector creates/updates the record. Lowest-code, cleanest.

Azure Functions — call DI, poll, shape the JSON, then POST to the Salesforce REST API with an integration user's OAuth token (store secrets in Key Vault, use managed identity for the DI call).

If the write-to-Salesforce must stay on the Salesforce side, expose an Azure Function/APIM endpoint that runs DI submit+poll internally and returns finished fields for their Apex/MuleSoft to consume.

A few Azure-side tips: send documents as bytes or a Blob SAS URL (DI must be able to reach them); use the per-field confidence scores to route low-confidence extractions to human review before writing; and prefer Entra + managed identity over keys for production.

Kindly let us know if the above helps or you need further assistance on this issue.

Azure AI Doc Intelligence data extraction in salesforce

Your answer