AWS IDP Accelerator auto-generates schemas from unlabeled document collections
Analyzes thousands of unknown documents, clusters by type, and outputs ready-to-use schemas automatically.
AWS introduces multi-document discovery for its IDP Accelerator, an open-source serverless solution for document processing. The feature automatically clusters unknown documents using visual embeddings (Cohere Embed v4 via Amazon Bedrock) and generates schemas with agents. It uses k-means clustering, testing k values from 2 to 20 and selecting the grouping with the highest silhouette score to identify document types without manual labeling. The resulting schemas integrate directly into the IDP Accelerator configuration file.
- Uses Cohere Embed v4 via Amazon Bedrock for visual embeddings that capture layout and formatting.
- Auto-clusters documents with silhouette-score-optimized k-means, testing k from 2 to 20.
- Generates schemas via Strands Agents + Bedrock LLM, with a reflection step to catch overlaps.
Why It Matters
Eliminates manual schema creation for enterprise IDP, enabling rapid deployment on massive, unlabeled document collections.