docsense.indexer

Document indexing and processing module.

class Document(content, metadata=None)[source]

Represents a document or a chunk of a document.

A Document object contains the actual text content and associated metadata. The metadata can include information like source, timestamps, chunk positions, etc.

Parameters:
__init__(content, metadata=None)[source]

Initialize a Document object.

Parameters:
  • content (str) – The text content of the document

  • metadata (Optional[dict]) – Optional dictionary containing metadata about the document. If None, an empty dict will be used.

content: str
metadata: dict
class DocumentLoader[source]

Load documents from various sources.

__init__()[source]
load_directory(path)[source]

Load all supported documents from a directory recursively.

Parameters:

path (str | Path) – Directory path to load documents from

Return type:

list[Document]

Returns:

List of Document objects containing file contents and metadata

Raises:
class VectorStore(dimension, index_path=None, use_gpu=True)[source]

Vector store for document embeddings using FAISS.

This class implements a vector store that uses FAISS for efficient similarity search of document embeddings. It supports: - GPU acceleration when available - Persistence to disk - Document metadata management - IVF (Inverted File) index for faster search

Parameters:
__init__(dimension, index_path=None, use_gpu=True)[source]

Initialize the vector store.

Parameters:
  • dimension (int) – Dimension of the embedding vectors

  • index_path (Optional[str]) – Path to load/save the index and metadata. If None, store will be in-memory only

  • use_gpu (bool) – Whether to use GPU for FAISS operations. Falls back to CPU if GPU is not available

Raises:
add_documents(documents, embeddings)[source]

Add documents and their embeddings to the store.

Parameters:
  • documents (List[Document]) – List of Document objects to add

  • embeddings (ndarray) – numpy array of document embeddings with shape (n_docs, dimension)

Raises:

ValueError – If number of documents doesn’t match number of embeddings, or if embedding dimensions don’t match

Return type:

None

search(query_embedding, k=2)[source]

Search for most similar documents using the query embedding.

Parameters:
  • query_embedding (ndarray) – Query vector with shape (dimension,) or (1, dimension)

  • k (int) – Number of results to return

Return type:

List[Tuple[Document, float]]

Returns:

List of (document, distance) tuples sorted by similarity (closest first)

Raises:

ValueError – If query_embedding has invalid shape

save()[source]

Save the index and metadata to disk.

This method saves both the FAISS index and document metadata to the specified index path. The index is saved in FAISS binary format and metadata in JSON.

Raises:
Return type:

None

load()[source]

Load the index and metadata from disk.

This method loads both the FAISS index and document metadata from the specified index path. The index is loaded from FAISS binary format and metadata from JSON.

Raises:
Return type:

None

clear()[source]

Clear all documents and reset the index.

This method removes all documents and their embeddings from the store, effectively resetting it to its initial state. If persistence is enabled, the cleared state will be saved to disk.

Return type:

None

Modules

document

Document processing module for loading and chunking documents.

document_loader

Document loader implementation.

vector_store

Vector store implementation using FAISS.