Lexical Retrieval Engine: Structural Precision
The lexical engine is built upon an inverted indexing framework enhanced by a hot/cold storage architecture. During indexing, documents are tokenized into lexemes, normalized for Unicode consistency, and processed through language-aware stemming pipelines capable of handling both Devanagari and Latin scripts.
Each term is inserted into a structured TSVector representation, which records:
- Exact frequency within the document
- Positional offsets
- Document length metadata
This design enables more than simple keyword matching. It enables phrase proximity ranking, positional boosting, and precise citation resolution.
TF-IDF / BM25 Scoring
Relevance scoring is computed using Term Frequency (TF) or optionally BM25, governed by the formulation:
Where term saturation and document-length normalization are tunable to reflect the nature of legal text, which is often verbose and hierarchical.
Hot/Cold Storage Tiers
The lexical index is partitioned into two tiers:
- A Hot Store, maintained in memory for rapid lookup and high-frequency terms.
- A Cold Store, persisted on disk using a B-Tree implementation to ensure logarithmic-time lookups and stable range querying.
The B-Tree design was selected due to its disk-optimized node structure, predictable O(log n) search complexity, and stability under high-volume indexing. A Least Frequently Used (LFU) eviction strategy asynchronously flushes memory-resident terms into the cold store when thresholds are reached, preventing memory overflow while preserving performance.
Example in Practice: Tokenization and TSVector Creation
from nep_search.textprocessing import NepaliProcessor
from nep_search.fulltext import TSVector
processor = NepaliProcessor()
# 1. Cleaning and tokenizing mixed-language/Nepali text
raw_text = "नेपालको संविधान २०७२, Article 1"
cleaned_text = processor._clean_text(raw_text)
tokens = processor._tokenize_text(cleaned_text)
# Output filters stopwords/punctuation, normalizes Unicode
# 2. Creating a TSVector for the lexical inverted index
ts_vector = TSVector.from_lexeme_list(tokens)
# Captures lexeme positional data for phrase ranking / proximities