Documentation - NEUROVEDIK

🌟 Overview

NEUROVEDIK is a high-performance arbitrary-precision arithmetic library that achieves 2x speedup over GMP for cryptographic workloads by combining ancient Vedic mathematical algorithms with modern SIMD hardware.

                            Key Innovations
                            Vedic Urdhva Tiryakbhyam: Cross-multiplication naturally
                                    parallelizes
AVX-512 SIMD: Process 8 × 64-bit integers simultaneously
Smart Dispatch: Auto-select optimal algorithm per input size
Constant-Time: Side-channel resistant crypto operations

                        

🧮 Vedic Algorithms

Urdhva Tiryakbhyam (Vertically and Crosswise)

The Urdhva Tiryakbhyam sutra is an ancient Vedic multiplication method that computes each column of the result independently. Unlike traditional schoolbook multiplication which processes row-by-row, this method processes column-by-column.

Algorithm Visualization

     A = [a₃ a₂ a₁ a₀]
   × B = [b₃ b₂ b₁ b₀]
   ─────────────────────
   
   Column 0: a₀×b₀
   Column 1: a₀×b₁ + a₁×b₀
   Column 2: a₀×b₂ + a₁×b₁ + a₂×b₀
   Column 3: a₀×b₃ + a₁×b₂ + a₂×b₁ + a₃×b₀
   Column 4: a₁×b₃ + a₂×b₂ + a₃×b₁
   Column 5: a₂×b₃ + a₃×b₂
   Column 6: a₃×b₃

Why this matters: Each column sum can be computed in parallel! This maps perfectly to SIMD architectures where we can compute all products for a column in a single instruction.

Nikhilam Sutra (Division)

The Nikhilam sutra provides fast division when the divisor is close to a power of 10 (or power of 2 in binary). It converts division into subtraction and small multiplications.

Example

Divide 10004 by 98:
  98 is 100 - 2, so complement = 2
  
  10004 → Split: 100 | 04
  
  100 × 2 = 200
  04 + 200 = 204 → Split: 2 | 04
  
  Result: 102 remainder 08

⚡ SIMD Acceleration

AVX-512 Implementation

NEUROVEDIK uses AVX-512 instructions to process 8 × 64-bit limbs simultaneously. The 512-bit ZMM registers and specialized instructions provide massive throughput.

SIMD Level	Register Size	64-bit Limbs	Speedup
Scalar	64 bits	1	1x (baseline)
AVX2	256 bits	4	~2.5x
AVX-512	512 bits	8	~4x

Key Instructions Used

VPMULUDQ - Multiply packed unsigned doublewords
VPADDQ - Add packed quadwords
VPERMQ - Permute quadwords for cross-multiplication
VMOVDQA64 - Aligned 64-byte load/store

🎯 Smart Dispatcher

The Smart Dispatcher automatically selects the optimal algorithm based on input size, available CPU features, and the specific operation being performed.

< 64 bits

Native CPU MUL

O(1)

64 - 2048 bits

Vedic + AVX-512

O(N²) SIMD

2048 - 32K bits

Karatsuba Optimized

O(N^1.58)

> 32K bits

NTT (Goldilocks)

O(N log N)

💡

Vedic Sweet Spot: The 64-2048 bit range covers 95%+ of cryptographic operations (RSA, ECC, signatures), making this the most impactful optimization.

🔐 Cryptographic Operations

Montgomery Multiplication

Montgomery form allows efficient modular multiplication without expensive division. Numbers are transformed to/from Montgomery space.

Transform to Montgomery: a' = a × R mod N

Montgomery Multiply: (a' × b' × R⁻¹) mod N

Transform back: result = a' × R⁻¹ mod N

Modular Exponentiation

NEUROVEDIK uses the Montgomery Ladder for constant-time modular exponentiation, critical for RSA and Diffie-Hellman.

Chinese Remainder Theorem (CRT)

CRT optimization splits RSA operations on the 2048-bit modulus into two 1024-bit operations, providing a 4x speedup for decryption.

🤖 AI Optimizations

Early Exit Multiplication

For neural network inference, we often only need the top bits of precision. NEUROVEDIK's MSB-first computation allows early termination.

Traditional (LSB-first)

Computes all 2048 bits, discards lower bits

100% compute

NEUROVEDIK (MSB-first)

Computes only needed precision, exits early

40-60% compute

Approximate Multiply

When exactness isn't required (e.g., weight updates), approximate multiplication provides significant speedups while maintaining statistical accuracy.

💾 Memory Model

64-Byte Aligned BigInt

All BigInt allocations are 64-byte aligned to match cache line size and AVX-512 requirements. This eliminates cache line splits and enables aligned SIMD loads.

0x00 length (limbs) 8 bytes

0x08 capacity 8 bytes

0x10 limbs[0..5] 48 bytes

                                0x40
                                limbs[6..13]
                                64 bytes (next cache line)
                            

🛡️ Security Considerations

Constant-Time Operations

All cryptographic operations are implemented to run in constant time, preventing timing attacks.

✓ Constant-time comparison (no branch on secret bits)

✓ Montgomery ladder (same operations every iteration)

✓ Memory zeroing option (NV_SECURE_MEMORY)

✓ Miller-Rabin with 40 rounds (2⁻⁸⁰ error probability)

⚠ Not yet audited by third-party security researchers

⚠️

Production Warning: While NEUROVEDIK implements best practices for constant-time operations, it has not been formally audited. For high-security applications, supplement with audited cryptographic libraries.

NEUROVEDIK Documentation