Semantic Retrieval Engine: Contextual Understanding

Parallel to the lexical system, the semantic engine operates using transformer-generated dense embeddings. Each document is segmented into logical chunks—often aligned to legal sections or structural hierarchies—and transformed into a 384-dimensional embedding vector using a domain-aligned sentence-transformer model.

The embedding transformation is formally described as:

f : Text \rightarrow \mathbb{R}^{384}

These vectors represent semantic position in high-dimensional space. At query time, the incoming query is embedded into the same vector space, and similarity is measured using cosine similarity:

CosSim(A,B) = \frac{A \cdot B}{\lVert A\rVert \lVert B\rVert}

The engine then retrieves the nearest neighbors from a vector database using Approximate Nearest Neighbor (ANN) search, implemented via Qdrant’s HNSW index.

Semantic retrieval solves vocabulary mismatch by identifying conceptually similar passages even when explicit keywords differ. This is particularly valuable in interpretive legal queries, policy analysis, or conceptual searches.

Embedding generation is optimized via batch processing, CUDA acceleration where available, and gradient-free inference using torch.no_grad() to minimize computational overhead.

Example in Practice: Embedding Generation

import torch

# Using sentence-transformers within the NepaliProcessor
# Automatically utilizing CUDA if available
with torch.no_grad():
    embeddings = self._embedding_model.encode(
        text_chunks,
        convert_to_tensor=True,
        show_progress_bar=False,
        normalize_embeddings=True  # Ensure cosine similarities are accurate
    )
    
    # Efficient transfer to CPU & conversion to simple Python lists 
    vectorized_results = embeddings.cpu().numpy().tolist()

Example in Practice: Embedding Generation​

Example in Practice: Embedding Generation