๐ŸŒŸ Overview

NEUROVEDIK is a high-performance arbitrary-precision arithmetic library that achieves 2x speedup over GMP for cryptographic workloads by combining ancient Vedic mathematical algorithms with modern SIMD hardware.

Key Innovations

  • Vedic Urdhva Tiryakbhyam: Cross-multiplication naturally parallelizes
  • AVX-512 SIMD: Process 8 ร— 64-bit integers simultaneously
  • Smart Dispatch: Auto-select optimal algorithm per input size
  • Constant-Time: Side-channel resistant crypto operations

๐Ÿงฎ Vedic Algorithms

Urdhva Tiryakbhyam (Vertically and Crosswise)

The Urdhva Tiryakbhyam sutra is an ancient Vedic multiplication method that computes each column of the result independently. Unlike traditional schoolbook multiplication which processes row-by-row, this method processes column-by-column.

Algorithm Visualization

     A = [aโ‚ƒ aโ‚‚ aโ‚ aโ‚€]
   ร— B = [bโ‚ƒ bโ‚‚ bโ‚ bโ‚€]
   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   
   Column 0: aโ‚€ร—bโ‚€
   Column 1: aโ‚€ร—bโ‚ + aโ‚ร—bโ‚€
   Column 2: aโ‚€ร—bโ‚‚ + aโ‚ร—bโ‚ + aโ‚‚ร—bโ‚€
   Column 3: aโ‚€ร—bโ‚ƒ + aโ‚ร—bโ‚‚ + aโ‚‚ร—bโ‚ + aโ‚ƒร—bโ‚€
   Column 4: aโ‚ร—bโ‚ƒ + aโ‚‚ร—bโ‚‚ + aโ‚ƒร—bโ‚
   Column 5: aโ‚‚ร—bโ‚ƒ + aโ‚ƒร—bโ‚‚
   Column 6: aโ‚ƒร—bโ‚ƒ
                            

Why this matters: Each column sum can be computed in parallel! This maps perfectly to SIMD architectures where we can compute all products for a column in a single instruction.

Nikhilam Sutra (Division)

The Nikhilam sutra provides fast division when the divisor is close to a power of 10 (or power of 2 in binary). It converts division into subtraction and small multiplications.

Example
Divide 10004 by 98:
  98 is 100 - 2, so complement = 2
  
  10004 โ†’ Split: 100 | 04
  
  100 ร— 2 = 200
  04 + 200 = 204 โ†’ Split: 2 | 04
  
  Result: 102 remainder 08

โšก SIMD Acceleration

AVX-512 Implementation

NEUROVEDIK uses AVX-512 instructions to process 8 ร— 64-bit limbs simultaneously. The 512-bit ZMM registers and specialized instructions provide massive throughput.

SIMD Level Register Size 64-bit Limbs Speedup
Scalar 64 bits 1 1x (baseline)
AVX2 256 bits 4 ~2.5x
AVX-512 512 bits 8 ~4x

Key Instructions Used

  • VPMULUDQ - Multiply packed unsigned doublewords
  • VPADDQ - Add packed quadwords
  • VPERMQ - Permute quadwords for cross-multiplication
  • VMOVDQA64 - Aligned 64-byte load/store

๐ŸŽฏ Smart Dispatcher

The Smart Dispatcher automatically selects the optimal algorithm based on input size, available CPU features, and the specific operation being performed.

< 64 bits
Native CPU MUL
O(1)
64 - 2048 bits
Vedic + AVX-512
O(Nยฒ) SIMD
2048 - 32K bits
Karatsuba Optimized
O(N^1.58)
> 32K bits
NTT (Goldilocks)
O(N log N)
๐Ÿ’ก
Vedic Sweet Spot: The 64-2048 bit range covers 95%+ of cryptographic operations (RSA, ECC, signatures), making this the most impactful optimization.

๐Ÿ” Cryptographic Operations

Montgomery Multiplication

Montgomery form allows efficient modular multiplication without expensive division. Numbers are transformed to/from Montgomery space.

Transform to Montgomery: a' = a ร— R mod N

Montgomery Multiply: (a' ร— b' ร— Rโปยน) mod N

Transform back: result = a' ร— Rโปยน mod N

Modular Exponentiation

NEUROVEDIK uses the Montgomery Ladder for constant-time modular exponentiation, critical for RSA and Diffie-Hellman.

Chinese Remainder Theorem (CRT)

CRT optimization splits RSA operations on the 2048-bit modulus into two 1024-bit operations, providing a 4x speedup for decryption.

๐Ÿค– AI Optimizations

Early Exit Multiplication

For neural network inference, we often only need the top bits of precision. NEUROVEDIK's MSB-first computation allows early termination.

Traditional (LSB-first)

Computes all 2048 bits, discards lower bits

100% compute

NEUROVEDIK (MSB-first)

Computes only needed precision, exits early

40-60% compute

Approximate Multiply

When exactness isn't required (e.g., weight updates), approximate multiplication provides significant speedups while maintaining statistical accuracy.

๐Ÿ’พ Memory Model

64-Byte Aligned BigInt

All BigInt allocations are 64-byte aligned to match cache line size and AVX-512 requirements. This eliminates cache line splits and enables aligned SIMD loads.

Offset Field Size
0x00 length (limbs) 8 bytes
0x08 capacity 8 bytes
0x10 limbs[0..5] 48 bytes
0x40 limbs[6..13] 64 bytes (next cache line)

๐Ÿ›ก๏ธ Security Considerations

Constant-Time Operations

All cryptographic operations are implemented to run in constant time, preventing timing attacks.

โœ“ Constant-time comparison (no branch on secret bits)
โœ“ Montgomery ladder (same operations every iteration)
โœ“ Memory zeroing option (NV_SECURE_MEMORY)
โœ“ Miller-Rabin with 40 rounds (2โปโธโฐ error probability)
โš  Not yet audited by third-party security researchers
โš ๏ธ
Production Warning: While NEUROVEDIK implements best practices for constant-time operations, it has not been formally audited. For high-security applications, supplement with audited cryptographic libraries.