AI Engineer

Paperless AI System

Three-Layer AI Document Management with Multi-OCR & Semantic Search

3,000+
Documents Auto-Organized
90%+
OCR Accuracy
<1s
Semantic Search
0
Manual Tagging

Overview

Built a comprehensive AI system on top of paperless-ngx combining intelligent OCR, auto-classification, and semantic search. The system processes 3,000+ documents automatically with zero manual tagging.

Three integrated services work together: paperless-gpt (Go) for multi-provider OCR with worker pool and LLM-powered auto-tagging, paperless-chroma (Python) for ChromaDB vector database with semantic search, and paperless-ngx for core document storage and UI.

!The Challenge

10 years of documents with useless search—OCR couldn't handle poor scans, no semantic understanding.

The Solution

Three-layer AI system: multi-provider OCR (GPT-4 Vision, Google Document AI, Ollama), LLM auto-tagging, ChromaDB vector search.

Technical Implementation

Intelligent OCR Layer

  • Multi-provider OCR routing (GPT-4 Vision, Google Document AI, Ollama)
  • Worker pool with 4 concurrent processors
  • 90%+ accuracy on poor-quality scans
  • Automatic provider selection based on document type

Auto-Classification

  • LLM-powered title generation
  • Automatic tag assignment
  • Correspondent identification
  • Document date extraction

Semantic Search

  • ChromaDB vector database
  • BAAI/bge-base-en-v1.5 embeddings
  • Document chunking (1000 chars, 200 overlap)
  • Concept-based search ("car accident" finds insurance claims)

Tech Stack

Backend
GoPythonFlask
Frontend
React
AI/ML
GPT-4 VisionGoogle Document AIOllamaSentence Transformers
Database
ChromaDBPostgreSQL
Infrastructure
DockerDocker Compose

Skills Demonstrated

GoPythonGPT-4 VisionGoogle Document AIOllamaChromaDBSentence TransformersDockerLLM Integration

Have a Similar Project?

Let's discuss how we can help you achieve similar results.

Get in Touch