Collections (Multi-Modal Search)
The CollectionManager is the powerhouse of BeaverDB. It provides a unified interface for storing Documents and performing Vector Search, Full-Text Search (FTS), Fuzzy Matching, and Graph Traversal on them.
This is ideal for building RAG (Retrieval Augmented Generation) applications, search engines, knowledge graphs, or recommendation systems.
Quick Start
A Collection stores Document objects. Each document has a unique id, an optional vector embedding, and a JSON-serializable body.
from beaver import BeaverDB
from beaver.collections import Document
db = BeaverDB("app.db")
docs = db.collection("articles")
# 1. Index a Document (with embedding for semantic search)
doc = Document(
id="doc_1",
body={"title": "Introduction to AI", "content": "AI is changing the world..."},
embedding=[0.1, 0.2, 0.8, ...] # 768-dim vector
)
docs.index(doc)
# 2. Semantic Search (Vector)
# Finds documents semantically similar to the query vector
results = docs.search(query_vector, top_k=5)
# 3. Keyword Search (FTS)
# Finds documents containing specific words
matches = docs.match("artificial intelligence")Managing Documents
The Document Object
BeaverDB uses a strict Pydantic model for items.
id: String (UUID by default).embedding: List of floats (or None).body: Dictionary or Pydantic model (Metadata).
from pydantic import BaseModel
class Article(BaseModel):
title: str
tags: list[str]
# Typed Collection
articles = db.collection("news", model=Article)
# Indexing
articles.index(Document(body=Article(title="New Release", tags=["tech"])))Indexing
The .index() method performs an atomic upsert. It updates the main storage, vector index, FTS index, and fuzzy n-gram index simultaneously.
# Index with specific FTS fields
docs.index(doc, fts=["title", "content"])
# Enable Fuzzy Search support (slower indexing, robust matching)
docs.index(doc, fts=["title"], fuzzy=True)Deleting & Clearing
# Remove a single document
docs.drop(doc)
# Wipe the entire collection (Vectors, Graph, FTS)
docs.clear()Search Capabilities
1. Vector Search (Semantic)
Performs an Approximate Nearest Neighbor (ANN) search using Cosine Similarity.
- Speed: Extremely fast (in-memory index).
- Persistence: The index is fully persisted to disk and crash-safe.
# Returns list of (Document, score) tuples
results = docs.search(embedding_vector, top_k=10)
for doc, score in results:
print(f"{score:.4f}: {doc.body['title']}")2. Full-Text Search (FTS)
Uses SQLite’s FTS5 engine for powerful keyword matching. Supports boolean operators (AND, OR, NOT) and prefix matching.
# Simple match
docs.match("python database")
# Specific fields
docs.match("tutorial", on=["title"])
# Boolean query
docs.match("python AND NOT java")3. Fuzzy Search (Typo Tolerance)
If you indexed with fuzzy=True, you can find documents even with typos. BeaverDB uses a hybrid Trigram + Levenshtein approach for high performance.
# Finds "BeaverDB" even if user types "BaverDB"
docs.match("BaverDB", fuzziness=1)4. Hybrid Search (Reranking)
To get the best of both worlds (semantic understanding + keyword precision), use Reverse Rank Fusion (RRF). BeaverDB provides a built-in rerank helper.
from beaver.collections import rerank
# 1. Run searches in parallel
vector_hits = [doc for doc, _ in docs.search(vec, top_k=50)]
keyword_hits = [doc for doc, _ in docs.match("query", top_k=50)]
# 2. Fuse results
# Scores are normalized based on rank position
final_results = rerank(vector_hits, keyword_hits, k=60)Graph & Relationships
BeaverDB is also a Graph Database. You can link documents together and traverse the connections to find related context (e.g., for GraphRAG).
Linking Documents
Create directed, labeled edges between documents.
# d1 -> d2 (d1 references d2)
docs.connect(d1, d2, label="references")
# d2 -> d3 (d2 is the parent of d3)
docs.connect(d2, d3, label="parent")Neighbors & Traversal
Retrieve connected documents efficiently.
# Get immediate neighbors
refs = docs.neighbors(d1, label="references")
# Multi-hop Traversal (BFS)
# "Find everything d1 connects to, up to 2 hops away"
# This runs a Recursive CTE in SQLite (Very Fast)
context = docs.walk(d1, labels=["references", "parent"], depth=2)Maintenance
Compaction
The vector index uses a Hybrid Architecture (Base Snapshot + Delta Log). Over time, the delta log grows.
- Auto-Compaction: Triggered automatically when the log gets too large.
- Manual Compaction: You can force a merge of the log into the main index.
docs.compact()Exporting
Dump the entire collection (vectors and metadata) to JSON for backup.
data = docs.dump()
# OR write directly to file
with open("backup.json", "w") as f:
docs.dump(f)