AI Engineer

Paperless AI System

Three-Layer AI Document Management with Multi-OCR & Semantic Search

3,000+

Documents Auto-Organized

90%+

OCR Accuracy

<1s

Semantic Search

Manual Tagging

Overview

Built a full AI system on top of paperless-ngx combining multi-provider OCR, auto-classification, and semantic search. The system processes 3,000+ documents automatically with zero manual tagging.

Three integrated services work together: paperless-gpt (Go) for multi-provider OCR with worker pool and LLM-powered auto-tagging, paperless-chroma (Python) for ChromaDB vector database with semantic search, and paperless-ngx for core document storage and UI.

!The Challenge

10 years of documents with useless search. OCR couldn't handle poor scans, no semantic understanding.

✓The Solution

Three-layer AI system: multi-provider OCR (GPT-4 Vision, Google Document AI, Ollama), LLM auto-tagging, ChromaDB vector search.

Technical Implementation

Intelligent OCR Layer

Multi-provider OCR routing (GPT-4 Vision, Google Document AI, Ollama)
Worker pool with 4 concurrent processors
90%+ accuracy on poor-quality scans
Automatic provider selection based on document type

Auto-Classification

LLM-powered title generation
Automatic tag assignment
Correspondent identification
Document date extraction

Semantic Search

ChromaDB vector database
BAAI/bge-base-en-v1.5 embeddings
Document chunking (1000 chars, 200 overlap)
Concept-based search ("car accident" finds insurance claims)

Tech Stack

Backend

GoPythonFlask

Frontend

React

AI/ML

GPT-4 VisionGoogle Document AIOllamaSentence Transformers

Database

ChromaDBPostgreSQL

Infrastructure

DockerDocker Compose

Skills Demonstrated

GoPythonGPT-4 VisionGoogle Document AIOllamaChromaDBSentence TransformersDockerLLM Integration

Have a Similar Project?

Let's discuss how we can help you achieve similar results.

Get in Touch