CocoIndex

Open-source incremental data-transformation engine for AI agents. Turns codebases, meeting notes, inboxes, Slack, PDFs, and videos into live, continuously fresh context for LLM apps β€” recomputing only the delta when sources change. Python declarative API on top of a Rust engine; Apache 2.0.

Tagline: β€œYour agents deserve fresh context.”

🧩 Core idea

  • Incremental β€” only the Ξ”. When a source changes, CocoIndex identifies affected records, propagates the change across joins/lookups, updates the target, and retires stale rows without touching anything else
  • Declarative β€” describe what should be in the target; engine keeps it in sync forever
  • Cached by content hash β€” @coco.fn(memo=True) keys cache on hash(input) + hash(code) so logic changes invalidate correctly
  • Parallel by default β€” Rust core, zero-copy transforms where possible, failure isolation per record
  • Lineage explainable β€” concept of sources β†’ flows β†’ targets with traceable provenance

πŸš€ Quick start

pip install -U cocoindex
import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
 
@coco.fn(memo=True)
async def index_file(file, table):
    for chunk in RecursiveSplitter().split(await file.read_text()):
        table.declare_row(text=chunk.text, embedding=embed(chunk.text))
 
@coco.fn
async def main(src):
    table = await postgres.mount_table_target(PG, table_name="docs")
    table.declare_vector_index(column="embedding")
    await coco.mount_each(index_file, localfs.walk_dir(src).items(), table)
 
coco.App(coco.AppConfig(name="docs"), main, src="./docs").update_blocking()

Run once to backfill. Re-run anytime β€” only changed files re-embed.

🎨 Why it matters for me

  • Solves the β€œbatch pipelines drift stale” problem head-on β€” directly relevant for Brain (this vault: re-embed only modified notes), Qamera AI (product catalog freshness), and any RAG-backed agent
  • Same problem space as LightRAG but with a sharper take on incrementality rather than retrieval quality; complementary, not competitive
  • Built-in MCP server (CocoIndex-code) gives Claude Code / Cursor a whole-repo semantic index without bespoke plumbing β€” drop-in for Agentic Coding workflows
  • The β€œReact for data engineering” mental model maps cleanly to declarative pipelines I already build in Make / Airtable

πŸ“’ Reference example tree (20+ recipes)

ExampleWhat it does
code_embeddingWalks git repo, AST-aware chunking, sentence-transformers β†’ pgvector / LanceDB; incremental per commit
pdf_embeddingPDFs from local / S3 / Google Drive β†’ recursive chunk β†’ vector index; only edited PDFs re-embed
hn_trending_topicsAlgolia HN API β†’ recursive comments β†’ Gemini 2.5 Flash extracts typed topics β†’ weighted ranking
conversation_to_knowledgeMeeting transcripts / Slack / podcasts β†’ LLM extracts people, topics, decisions, actions β†’ Neo4j / Kuzu
multi_codebase_summarizationWalk N repos β†’ README/API/module extraction β†’ LLM summary β†’ top-level rollup; only commits trigger re-run
patient_intake_extraction_bamlMessy forms/PDFs/invoices β†’ BAML or DSPy typed extraction β†’ Postgres
podcast β†’ knowledge graphYouTube audio β†’ Whisper/AssemblyAI diarization β†’ per-speaker statement extraction β†’ SurrealDB / Neo4j
csv_to_kafkaWatch CSV folder β†’ publish each row as JSON to Kafka topic (StreamNative/Confluent), keyed by PK; sub-second

☘️ Architecture notes

  • Rust core β€” production-grade from day one; parallel chunking, zero-copy where possible, per-record failure isolation
  • Connectors β€” localfs, postgres, plus pluggable sources/targets (S3, GDrive, Kafka, Neo4j, Kuzu, SurrealDB, pgvector, LanceDB)
  • Enterprise tier β€” petabyte-scale stores; β€œprocess once, reconcile forever”

πŸ“– Further reading