docsense.indexer.document

Document processing module for loading and chunking documents.

This module provides functionality for representing documents and splitting them into manageable chunks while preserving metadata. It includes classes for document representation and text chunking with configurable overlap.

Classes

Document(content[, metadata])

Represents a document or a chunk of a document.

DocumentChunker([chunk_size, chunk_overlap])

Splits documents into smaller chunks with configurable overlap.

class Document(content, metadata=None)[source]

Represents a document or a chunk of a document.

A Document object contains the actual text content and associated metadata. The metadata can include information like source, timestamps, chunk positions, etc.

Parameters:
__init__(content, metadata=None)[source]

Initialize a Document object.

Parameters:
  • content (str) – The text content of the document

  • metadata (Optional[dict]) – Optional dictionary containing metadata about the document. If None, an empty dict will be used.

content: str
metadata: dict
class DocumentChunker(chunk_size=1000, chunk_overlap=200)[source]

Splits documents into smaller chunks with configurable overlap.

This class provides functionality to split large text documents into smaller, overlapping chunks while preserving document metadata. It attempts to split at sentence boundaries to maintain context.

Parameters:
  • chunk_size (int)

  • chunk_overlap (int)

__init__(chunk_size=1000, chunk_overlap=200)[source]

Initialize the chunker.

Parameters:
  • chunk_size (int) – Maximum number of characters per chunk. Default is 1000.

  • chunk_overlap (int) – Number of characters to overlap between consecutive chunks to maintain context. Default is 200.

split(text, metadata)[source]

Split text into overlapping chunks while preserving metadata.

The method attempts to split at sentence boundaries to maintain readability and context. Each chunk inherits the original document’s metadata with additional chunk position information.

Parameters:
  • text (str) – Text content to split into chunks

  • metadata (dict) – Metadata to attach to each chunk. Will be extended with chunk-specific position information.

Return type:

List[Document]

Returns:

List of Document objects, each containing a chunk of the original text and associated metadata.