Skip to main content

Lexical Retrieval Engine: Structural Precision

The lexical engine is built upon an inverted indexing framework enhanced by a hot/cold storage architecture. During indexing, documents are tokenized into lexemes, normalized for Unicode consistency, and processed through language-aware stemming pipelines capable of handling both Devanagari and Latin scripts.

Each term is inserted into a structured TSVector representation, which records:

  • Exact frequency within the document
  • Positional offsets
  • Document length metadata

This design enables more than simple keyword matching. It enables phrase proximity ranking, positional boosting, and precise citation resolution.

TF-IDF / BM25 Scoring

Relevance scoring is computed using Term Frequency (TF) or optionally BM25, governed by the formulation:

BM25(d,q)=tqIDF(t)f(t,d)(k+1)f(t,d)+k(1b+bdavgdl)BM25(d,q)=\sum_{t \in q} IDF(t)\cdot\frac{f(t,d)(k+1)}{f(t,d)+k\left(1-b+b\cdot\frac{|d|}{avgdl}\right)}

Where term saturation and document-length normalization are tunable to reflect the nature of legal text, which is often verbose and hierarchical.

Hot/Cold Storage Tiers

The lexical index is partitioned into two tiers:

  • A Hot Store, maintained in memory for rapid lookup and high-frequency terms.
  • A Cold Store, persisted on disk using a B-Tree implementation to ensure logarithmic-time lookups and stable range querying.

The B-Tree design was selected due to its disk-optimized node structure, predictable O(log n) search complexity, and stability under high-volume indexing. A Least Frequently Used (LFU) eviction strategy asynchronously flushes memory-resident terms into the cold store when thresholds are reached, preventing memory overflow while preserving performance.

Example in Practice: Tokenization and TSVector Creation

from nep_search.textprocessing import NepaliProcessor
from nep_search.fulltext import TSVector

processor = NepaliProcessor()

# 1. Cleaning and tokenizing mixed-language/Nepali text
raw_text = "नेपालको संविधान २०७२, Article 1"
cleaned_text = processor._clean_text(raw_text)
tokens = processor._tokenize_text(cleaned_text)
# Output filters stopwords/punctuation, normalizes Unicode

# 2. Creating a TSVector for the lexical inverted index
ts_vector = TSVector.from_lexeme_list(tokens)
# Captures lexeme positional data for phrase ranking / proximities