CocoIndex
Open-source incremental data-transformation engine for AI agents. Turns codebases, meeting notes, inboxes, Slack, PDFs, and videos into live, continuously fresh context for LLM apps β recomputing only the delta when sources change. Python declarative API on top of a Rust engine; Apache 2.0.
Tagline: βYour agents deserve fresh context.β
π§© Core idea
- Incremental β only the Ξ. When a source changes, CocoIndex identifies affected records, propagates the change across joins/lookups, updates the target, and retires stale rows without touching anything else
- Declarative β describe what should be in the target; engine keeps it in sync forever
- Cached by content hash β
@coco.fn(memo=True)keys cache onhash(input) + hash(code)so logic changes invalidate correctly - Parallel by default β Rust core, zero-copy transforms where possible, failure isolation per record
- Lineage explainable β concept of sources β flows β targets with traceable provenance
π Links
- GitHub: https://github.com/cocoindex-io/cocoindex (Apache 2.0)
- Homepage: https://cocoindex.io/
- Docs: https://cocoindex.io/docs
- Quickstart: https://cocoindex.io/docs/getting_started/quickstart
- Discord: https://discord.com/invite/zpA9S2DR7s
- X: https://x.com/cocoindex_io
- CocoIndex skill for AI coding agents (Claude Code, Cursor): https://github.com/cocoindex-io/cocoindex/blob/main/skills/cocoindex
- Flagship: CocoIndex-code β MCP server giving Claude Code / Cursor an AST-aware, incremental, semantic code index of the whole repo
π Quick start
pip install -U cocoindeximport cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter
@coco.fn(memo=True)
async def index_file(file, table):
for chunk in RecursiveSplitter().split(await file.read_text()):
table.declare_row(text=chunk.text, embedding=embed(chunk.text))
@coco.fn
async def main(src):
table = await postgres.mount_table_target(PG, table_name="docs")
table.declare_vector_index(column="embedding")
await coco.mount_each(index_file, localfs.walk_dir(src).items(), table)
coco.App(coco.AppConfig(name="docs"), main, src="./docs").update_blocking()Run once to backfill. Re-run anytime β only changed files re-embed.
π¨ Why it matters for me
- Solves the βbatch pipelines drift staleβ problem head-on β directly relevant for Brain (this vault: re-embed only modified notes), Qamera AI (product catalog freshness), and any RAG-backed agent
- Same problem space as LightRAG but with a sharper take on incrementality rather than retrieval quality; complementary, not competitive
- Built-in MCP server (
CocoIndex-code) gives Claude Code / Cursor a whole-repo semantic index without bespoke plumbing β drop-in for Agentic Coding workflows - The βReact for data engineeringβ mental model maps cleanly to declarative pipelines I already build in Make / Airtable
π Reference example tree (20+ recipes)
| Example | What it does |
|---|---|
code_embedding | Walks git repo, AST-aware chunking, sentence-transformers β pgvector / LanceDB; incremental per commit |
pdf_embedding | PDFs from local / S3 / Google Drive β recursive chunk β vector index; only edited PDFs re-embed |
hn_trending_topics | Algolia HN API β recursive comments β Gemini 2.5 Flash extracts typed topics β weighted ranking |
conversation_to_knowledge | Meeting transcripts / Slack / podcasts β LLM extracts people, topics, decisions, actions β Neo4j / Kuzu |
multi_codebase_summarization | Walk N repos β README/API/module extraction β LLM summary β top-level rollup; only commits trigger re-run |
patient_intake_extraction_baml | Messy forms/PDFs/invoices β BAML or DSPy typed extraction β Postgres |
podcast β knowledge graph | YouTube audio β Whisper/AssemblyAI diarization β per-speaker statement extraction β SurrealDB / Neo4j |
csv_to_kafka | Watch CSV folder β publish each row as JSON to Kafka topic (StreamNative/Confluent), keyed by PK; sub-second |
βοΈ Architecture notes
- Rust core β production-grade from day one; parallel chunking, zero-copy where possible, per-record failure isolation
- Connectors β
localfs,postgres, plus pluggable sources/targets (S3, GDrive, Kafka, Neo4j, Kuzu, SurrealDB, pgvector, LanceDB) - Enterprise tier β petabyte-scale stores; βprocess once, reconcile foreverβ
π Further reading
- LightRAG β RAG framework with KG entity extraction; pairs well as the retrieval layer downstream of CocoIndex
- Graphify β knowledge-graph generator skill
- Brain β this vault; candidate ingestion target
- LLM Knowledge Bases
- AI Chatbots Architecture
- Qamera AI