Paperless AI System
Three-Layer AI Document Management with Multi-OCR & Semantic Search
Overview
Built a comprehensive AI system on top of paperless-ngx combining intelligent OCR, auto-classification, and semantic search. The system processes 3,000+ documents automatically with zero manual tagging.
Three integrated services work together: paperless-gpt (Go) for multi-provider OCR with worker pool and LLM-powered auto-tagging, paperless-chroma (Python) for ChromaDB vector database with semantic search, and paperless-ngx for core document storage and UI.
!The Challenge
10 years of documents with useless search—OCR couldn't handle poor scans, no semantic understanding.
✓The Solution
Three-layer AI system: multi-provider OCR (GPT-4 Vision, Google Document AI, Ollama), LLM auto-tagging, ChromaDB vector search.
Technical Implementation
Intelligent OCR Layer
- Multi-provider OCR routing (GPT-4 Vision, Google Document AI, Ollama)
- Worker pool with 4 concurrent processors
- 90%+ accuracy on poor-quality scans
- Automatic provider selection based on document type
Auto-Classification
- LLM-powered title generation
- Automatic tag assignment
- Correspondent identification
- Document date extraction
Semantic Search
- ChromaDB vector database
- BAAI/bge-base-en-v1.5 embeddings
- Document chunking (1000 chars, 200 overlap)
- Concept-based search ("car accident" finds insurance claims)