Mixed sources approach in Content Understanding

Question

Mixed sources approach in Content Understanding

Arvid 20

Hello,

I have been using Document Intelligence for a while but have now encountered a situation where I believe Content Understanding might be better. I have a process reading data from claims in bankruptcies. I currently use DI for reading PDFs, now I've been asked if we could extract the same data from email files (.msg and maybe .eml).

I have two problems:

Even though the documentation specifies that .msg files are accepted, they are not accepted as training documents ("Content Understanding now supports the following file formats on this page:.jpg, .jpeg, .jpe, .jif, .jfi, .jfif, .png, .tif, .tiff, .pdf, .bmp, .heic, .heif, .txt")
The emails are very different, it can be that all the data is in the body, or in a screengrab included in the body, or in body and attachments or an attached .doc combined with attached .pdfs. What is the best approach here? Ideally I'd send in the .msg and CU analyzes both body and attachments and decides what it needs to fill the fields. But how do I train it for this if I cant upload .msg files to the training view?

Many thanks in advance!

Arvid

0 comments

2 answers

Your answer

Answer 1

Arvid 20

But the documentation clearly states that .msg files are supported:
https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/service-limits

0 comments

Answer 2

Content Understanding and Document Intelligence currently share the same core input constraints as the Document Intelligence models shown in the context. Those constraints are important for planning a mixed-source approach.

About using .msg/.eml as training documents

The documented, supported formats for training and analysis in the referenced Document Intelligence models are:

PDF
Images: JPEG/JPG, PNG, BMP, TIFF, HEIF
Office: Word (DOCX), Excel (XLSX), PowerPoint (PPTX), HTML
Plain text (.txt) in some Content Understanding experiences

Email container formats such as .msg or .eml are not listed as supported training inputs in the Document Intelligence input requirements. For custom models (custom extraction, custom classification, custom neural) and prebuilt models, only the formats in the tables are supported. This aligns with what is seen in the Content Understanding training view: .msg is not accepted as a training file.

Given these constraints, .msg/.eml cannot be used directly as training documents in the Content Understanding training UI. Instead, the email content must be converted or extracted into one of the supported formats before training or inference.

Strategy for mixed sources (emails + attachments + PDFs)

Because the service operates on supported document types, the recommended pattern is to normalize all inputs into those types and then let the model handle the variability in where the data appears.

A practical approach:

Preprocess emails outside Content Understanding
- Parse .msg/.eml using application logic (for example, a mail-processing component in the pipeline) to extract:
  - Email body (as HTML or plain text → saved as .html or .txt)
  - Attachments (PDF, images, Office docs)
- For image-only content (e.g., screenshots embedded in the email body), render or export them as supported image formats (JPEG/PNG/TIFF/BMP/HEIF) or as a PDF.
Train on normalized documents
- Use the normalized outputs (PDF, images, DOCX/XLSX/PPTX, HTML, TXT) as training data for Content Understanding / Document Intelligence custom models.
- Ensure training data follows the documented input requirements:
  - Supported formats only.
  - Respect limits on file size, page count, image dimensions, and text height.
  - For custom models, stay within page and total-size limits (for example, up to 500 pages for custom template training data, 50,000 pages for custom neural, and the documented GB limits for classification).
Design the pipeline to handle variability
- For each incoming email:
  1. Extract body and attachments.
  2. Convert each part to a supported format if necessary.
  3. Decide which parts to send to the model:
    - If the body contains structured text, send the body (as HTML/TXT/PDF) to the model.
    - If key data is in attachments (PDFs, images, Office docs), send those directly.
    - If data may be split between body and attachments, either:
      - Concatenate content into a single normalized document (for example, a combined PDF or HTML), or
      - Run the model separately on each part and then merge extracted fields in application logic.
Use custom models suited to the variability
- For highly variable layouts (different email templates, screenshots, mixed attachments), a custom neural or custom classification + custom extraction approach is typically more robust than a strict template model, as long as training data stays within the documented limits.

This pattern lets Content Understanding/Document Intelligence operate on supported formats while the surrounding application handles .msg/.eml parsing and the decision logic about which parts of the email to use.

References: