E
Engram

Documentation

Introduction

Engram is a local-first AI operating system. Inference, embedding, vector storage, and scheduled agents all run inside Docker on your own hardware. No data leaves your machine during core operation.

Overview

Engram follows a headless host pattern: a Dockerized FastAPI kernel (core/brain.py) acts as the single authoritative backend. The Streamlit dashboard, CLI tools, and IDE extensions are all thin clients that speak to it over HTTP. The inference engine is Ollama — running on your own hardware, never proxied through any Engram-controlled endpoint.

Vector memory is stored in two Qdrant collections. Personal memories (second_brain) are Fernet-encrypted at the application layer before being written to Qdrant — compensating for the absence of at-rest encryption in the open-source Qdrant build. Documentation knowledge (doc_knowledge) is unencrypted for fast RAG retrieval; it contains only public content ingested explicitly by the user.

Scheduled agents (calendar sync, email triage) run via AsyncIOScheduler inside the FastAPI process. There is no separate worker container, no broker, and no result backend. Third-party integrations (Google Calendar, Gmail, Linear, Jira) are opt-in and communicate directly with their respective APIs — Engram is not a proxy, it does not relay credentials externally.


Design Decisions

These are the non-obvious architectural choices and the reasoning behind them.

Why local-first, not hybrid?

Cloud AI creates an unavoidable egress path at the infrastructure level, regardless of the provider's data retention policy. For regulated industries, the egress event itself is the compliance risk — not what happens to data after it arrives. A hybrid architecture that routes sensitive context through a cloud LLM for "performance" reintroduces the exact risk the deployment model is meant to eliminate.

Engram eliminates egress at the architecture level. There is no fallback inference endpoint, no telemetry SDK, and no analytics call home. Once Ollama model weights are downloaded, the core chat and RAG pipeline has zero network requirements.

Why Qdrant?

Qdrant provides persistent on-disk HNSW indexes with cosine similarity search, a Docker-native deployment with a healthcheck-compatible HTTP API, and — critically — no managed cloud tier that could accidentally become a data egress path. Both collections use the same 768-dimensional embedding space (nomic-embed-text:latest), which means retrieval scores from second_brain and doc_knowledge are directly comparable — enabling the unified POST /api/search/unified endpoint.

Collection separation

second_brain holds personal memories (encrypted). doc_knowledge holds externally ingested documentation (unencrypted). They are queried separately and merged at the application layer, not at the Qdrant level.

Why APScheduler over Celery?

For a single-user local deployment, Celery requires a Redis broker, a worker container, a beat container, and optionally a result backend — four additional processes for jobs that run every 15 to 60 minutes. AsyncIOScheduler from APScheduler runs in the FastAPI process, starts on the lifespan startup event, and needs zero additional infrastructure.

Single-process tradeoff

If os_layer crashes, APScheduler stops with it. Scheduled agents will not run until the container restarts. This is acceptable for local-first single-user deployments but would be inadequate for multi-tenant or high-availability requirements.

Why application-layer encryption?

The open-source Qdrant build does not provide at-rest encryption for vector payloads. EncryptedMemoryClient wraps QdrantClient and applies Fernet (AES-128-CBC + HMAC-SHA256) before every write and after every read. Payload fields listed in PLAINTEXT_KEYS — specifically user_id, type, and classification — remain unencrypted so Qdrant can evaluate filter conditions without decrypting the full payload.

ENGRAM_ENCRYPTION_KEY must be shared across containers

All containers that read or write second_brain (currently os_layer) must use an identical key. Divergent keys produce silent decryption failures — records will read as garbage rather than throwing an explicit error. Set ENGRAM_ENCRYPTION_KEY once in .env and let Docker Compose inject it uniformly.

Request Lifecycle

What happens, in order, when a message reaches POST /chat.

  1. 1

    POST /chatFastAPI validates UserInput schema: text, optional user_id, optional matter_id.

  2. 2

    classify()Input is classified via keyword + embedding heuristics → Classification IntEnum (GENERAL → CLASSIFIED). Higher classifications may trigger content sanitization before embedding.

  3. 3

    get_identity()Resolves stable user UUID from ~/.engram/identity.json. Auto-generated on first run, persisted. ENGRAM_USER_ID env var overrides for multi-container consistency.

  4. 4

    POST /api/embeddings768-dim float vector generated by Ollama (nomic-embed-text:latest). Called directly over HTTP — no LangChain, no SDK.

  5. 5

    EncryptedMemoryClient.search()Qdrant cosine similarity query on second_brain, filtered by user_id and optionally matter_id. Returns top-K payload objects. Each is Fernet-decrypted before use.

  6. 6

    RAG assemblyDecrypted context memories assembled into a structured prompt string. Injected into the system message of the chat completion request.

  7. 7

    POST /api/chatOllama streaming completion (llama3.1:latest). The full conversation history + RAG context is sent. Response is streamed back to the client.

  8. 8

    EncryptedMemoryClient.write()The exchange is stored as a new vector point in second_brain. Payload encrypted via Fernet before upsert. Point ID is a SHA-256 hash of user_id + content hash.

  9. 9

    audit_writerAudit log entry written via Unix socket to the audit_writer service. Entry is HMAC-SHA256 chained to the previous entry — any tampering breaks the chain.


System Requirements

ComponentMinimumRecommendedNotes
RAM8 GB16 GBQdrant ≈2 GB · API+Scheduler ≈2 GB · Dashboard ≈512 MB · Ollama model weight varies
Storage20 GB SSD40 GB SSDllama3.1:8b ≈ 4.7 GB · nomic-embed-text ≈ 270 MB · vector index grows with usage
CPU4 cores8+ coresCPU-only inference is functional but slow (~10–30s/response). GPU acceleration strongly recommended.
GPUNone requiredApple MPS / NVIDIA CUDAMetal acceleration on M1+ is automatic via Ollama. CUDA requires nvidia-container-toolkit.
OSmacOS 14+ / Ubuntu 22.04 / Win 10+ WSL2Docker Desktop required on macOS and Windows. Native Docker Engine on Linux.

Apple Silicon

Ollama uses Metal Performance Shaders (MPS) on M1/M2/M3 automatically — no configuration needed. Inference speed on Apple Silicon is typically 10–20× faster than CPU-only mode.

Quick Start

Clone the repository, run the one-time setup script (generates secrets, installs deps), then start the full stack. See the Installation page for platform-specific notes and Google OAuth setup.

bash
git clone https://github.com/engram-os/engram-os.git
cd engram-os
chmod +x scripts/setup.sh && ./scripts/setup.sh
./scripts/start.sh

After startup: localhost:8000 (API) · localhost:8501 (Dashboard) · APScheduler agents start automatically in-process.