docsense.indexer.document
Document processing module for loading and chunking documents.
This module provides functionality for representing documents and splitting them into manageable chunks while preserving metadata. It includes classes for document representation and text chunking with configurable overlap.
Classes
|
Represents a document or a chunk of a document. |
|
Splits documents into smaller chunks with configurable overlap. |
- class Document(content, metadata=None)[source]
Represents a document or a chunk of a document.
A Document object contains the actual text content and associated metadata. The metadata can include information like source, timestamps, chunk positions, etc.
- class DocumentChunker(chunk_size=1000, chunk_overlap=200)[source]
Splits documents into smaller chunks with configurable overlap.
This class provides functionality to split large text documents into smaller, overlapping chunks while preserving document metadata. It attempts to split at sentence boundaries to maintain context.
- split(text, metadata)[source]
Split text into overlapping chunks while preserving metadata.
The method attempts to split at sentence boundaries to maintain readability and context. Each chunk inherits the original document’s metadata with additional chunk position information.
- Parameters:
- Return type:
- Returns:
List of Document objects, each containing a chunk of the original text and associated metadata.