Developer Tools

AWS IDP Accelerator auto-generates schemas from unlabeled document collections

Analyzes thousands of unknown documents, clusters by type, and outputs ready-to-use schemas automatically.

Deep Dive

AWS introduces multi-document discovery for its IDP Accelerator, an open-source serverless solution for document processing. The feature automatically clusters unknown documents using visual embeddings (Cohere Embed v4 via Amazon Bedrock) and generates schemas with agents. It uses k-means clustering, testing k values from 2 to 20 and selecting the grouping with the highest silhouette score to identify document types without manual labeling. The resulting schemas integrate directly into the IDP Accelerator configuration file.

Key Points
  • Uses Cohere Embed v4 via Amazon Bedrock for visual embeddings that capture layout and formatting.
  • Auto-clusters documents with silhouette-score-optimized k-means, testing k from 2 to 20.
  • Generates schemas via Strands Agents + Bedrock LLM, with a reflection step to catch overlaps.

Why It Matters

Eliminates manual schema creation for enterprise IDP, enabling rapid deployment on massive, unlabeled document collections.