CocoIndex
Open-source incremental data framework that keeps RAG indexes and agent context continuously fresh.
Pick CocoIndex if you're building a code- or document-aware agent that needs a continuously fresh index without re-embedding the world on every run.
Skip it if you want a managed RAG SaaS, a no-code dashboard, or a non-Python stack.
CocoIndex is a Python-native, open-source data framework built to feed AI agents and RAG pipelines with continuously fresh context. Instead of re-embedding entire corpora on every run, it tracks deltas in codebases, documents, and other sources and reprocesses only what changed, with end-to-end lineage and automatic schema evolution. Out of the box it does AST-based code indexing (via tree-sitter), call-graph and symbol-table extraction, semantic search, and parallel task scheduling.
It's aimed at engineers building long-horizon agents - code-review bots, refactoring assistants, security scanners, knowledge-graph extractors over meeting notes, multi-repo summarizers - where stale indexes are the whole problem. Pricing isn't published because the framework itself is free and self-hosted; you bring your own Postgres/pgvector, embedding model, and LLM. There's a Claude skill integration and starter projects that claim a 10-minute path to production.
Think of it as the dbt-for-RAG layer: declarative transformations, incremental computation, and lineage, with first-class support for source-code semantics that generic vector-DB ETL tools ignore.
CocoIndex sits in the unglamorous but critical 'keep the index honest' layer that most agent demos quietly skip. The AST-based code indexing and incremental lineage are genuinely differentiated versus generic chunk-and-embed pipelines. Expect to do real infra work - this is a framework, not a product you log into.
— The AI Tool Bible editorial team
Pros
- ✅ Incremental reprocessing keeps indexes sub-second fresh without full reruns
- ✅ AST-aware code indexing with call graphs, not just naive text chunking
- ✅ Open source and self-hosted; works with Postgres/pgvector
- ✅ Declarative Python API with lineage and schema evolution built in
Cons
- ⚠️ Self-hosted only - you operate the database, embeddings, and LLM yourself
- ⚠️ Python-only framework; no managed cloud or hosted UI
- ⚠️ Younger ecosystem than LlamaIndex or LangChain
Use cases
Explore related
Compare with similar tools
All in RAG →Pinecone
FeaturedManaged vector database for production-scale similarity search.
LlamaIndex
FeaturedData framework for connecting LLMs to your data.
Weaviate
Open-source vector DB with hybrid search and modules.
LangChain
The broad LLM application framework — chains, agents, retrievers.
Vespa
Yahoo's open-source search engine with vector + sparse retrieval.
Chroma
Embedded, developer-friendly vector store for Python.