docsense.indexer

Document indexing and processing module.

class Document(content, metadata=None)[source]

Represents a document or a chunk of a document.

A Document object contains the actual text content and associated metadata. The metadata can include information like source, timestamps, chunk positions, etc.

Parameters:

content (str)
metadata (Optional[dict])

__init__(content, metadata=None)[source]

Initialize a Document object.

Parameters:

content (str) – The text content of the document
metadata (Optional[dict]) – Optional dictionary containing metadata about the document. If None, an empty dict will be used.

content: str

metadata: dict

class DocumentLoader[source]

Load documents from various sources.

__init__()[source]

load_directory(path)[source]

Load all supported documents from a directory recursively.

Parameters:

path (str | Path) – Directory path to load documents from

Return type:

list[Document]

Returns:

List of Document objects containing file contents and metadata

Raises:

FileNotFoundError – If directory does not exist
NotADirectoryError – If path is not a directory
ValueError – If no supported documents are found

class VectorStore(dimension, index_path=None, use_gpu=True)[source]

Vector store for document embeddings using FAISS.

This class implements a vector store that uses FAISS for efficient similarity search of document embeddings. It supports: - GPU acceleration when available - Persistence to disk - Document metadata management - IVF (Inverted File) index for faster search

Parameters:

dimension (int)
index_path (Optional[str])
use_gpu (bool)

__init__(dimension, index_path=None, use_gpu=True)[source]

Initialize the vector store.

Parameters:

dimension (int) – Dimension of the embedding vectors
index_path (Optional[str]) – Path to load/save the index and metadata. If None, store will be in-memory only
use_gpu (bool) – Whether to use GPU for FAISS operations. Falls back to CPU if GPU is not available

Raises:

ValueError – If dimension is invalid
RuntimeError – If GPU initialization fails

add_documents(documents, embeddings)[source]

Add documents and their embeddings to the store.

Parameters:

documents (List[Document]) – List of Document objects to add
embeddings (ndarray) – numpy array of document embeddings with shape (n_docs, dimension)

Raises:

ValueError – If number of documents doesn’t match number of embeddings, or if embedding dimensions don’t match

Return type:

None

search(query_embedding, k=2)[source]

Search for most similar documents using the query embedding.

Parameters:

query_embedding (ndarray) – Query vector with shape (dimension,) or (1, dimension)
k (int) – Number of results to return

Return type:

List[Tuple[Document, float]]

Returns:

List of (document, distance) tuples sorted by similarity (closest first)

Raises:

ValueError – If query_embedding has invalid shape

save()[source]

Save the index and metadata to disk.

This method saves both the FAISS index and document metadata to the specified index path. The index is saved in FAISS binary format and metadata in JSON.

Raises:

ValueError – If no index path was specified
IOError – If saving fails

Return type:

None

load()[source]

Load the index and metadata from disk.

This method loads both the FAISS index and document metadata from the specified index path. The index is loaded from FAISS binary format and metadata from JSON.

Raises:

ValueError – If no index path was specified or dimension mismatch
FileNotFoundError – If index or metadata files are missing
IOError – If loading fails

Return type:

None

clear()[source]

Clear all documents and reset the index.

This method removes all documents and their embeddings from the store, effectively resetting it to its initial state. If persistence is enabled, the cleared state will be saved to disk.

Return type:: None

Modules

`document`	Document processing module for loading and chunking documents.
`document_loader`	Document loader implementation.
`vector_store`	Vector store implementation using FAISS.