Azure AI Doc Intelligence data extraction in salesforce

IT Cognity 0 Reputation points
2026-07-03T07:08:19.3566667+00:00

Hello Microsoft Team,

We are using Azure AI Document Intelligence in a Salesforce-based digital onboarding flow for document classification and data extraction.

Our current flow is:

Document upload → Azure Document Intelligence custom classification model → Routing to the corresponding extraction model based on the predicted document type → Field extraction and business validation

The classifier is used to identify different types of identification documents, such as older ID cards, newer ID cards, passports, residence permits, security force IDs, etc.

To handle cases where users upload documents that are not valid identification documents, we have also added a separate class/category named other.

For the other class, we have used training samples that represent non-ID documents we expect users may upload by mistake, such as utility bills, property-related documents, and similar business documents. For these expected non-ID documents, the classifier generally works correctly and classifies them as other.

However, our issue is with completely irrelevant or unexpected files/images. For example, random images, screenshots, arbitrary documents, or files that are clearly not identification documents are sometimes classified as one of the ID document classes instead of other.

In some of these cases, the classification confidence returned by Azure is also high. We have observed cases where the confidence score for an irrelevant file classified as an ID type is higher than the confidence score for a valid ID document classified correctly. Because of this, using only a confidence threshold is risky for our use case, since it could either fail to reject irrelevant files or incorrectly reject valid ID documents.

We would like Microsoft’s official guidance on the expected behavior and recommended design for this scenario.

Specifically:

  1. Would Microsoft recommend a different architecture for this use case?
  2. If the other class is expected to work for this scenario, what type of training data should be included in it?
  3. Is there a recommended maximum or practical limit for how broad the other class should be before it becomes ineffective or negatively affects classification between valid ID document classes?
  4. Since the confidence score is not consistently lower for irrelevant files, what is Microsoft’s recommended approach for rejecting documents that do not belong to any valid ID category?
  5. More generally, what is the recommended way to handle out-of-scope / out-of-distribution documents in Azure Document Intelligence custom classification, especially when the possible invalid inputs are very broad and cannot be fully represented in training data?

Our goal is to understand whether the current behavior is expected product behavior, whether a broad other class can reliably solve this problem, and what Microsoft recommends as the best-practice approach for rejecting irrelevant files without incorrectly rejecting valid ID documents.

Content Safety in Foundry Control Plane
Content Safety in Foundry Control Plane

An Azure service that enables users to identify content that is potentially offensive, risky, or otherwise undesirable. Previously known as Azure Content Moderator.


Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.