Performance Benchmarks

This guide explains how to measure MiniTensor performance, compare it with other frameworks, and avoid common benchmarking mistakes.

Benchmark commands

Run the bundled Python benchmark from the repository root:

python examples/performance_benchmark.py

You can also use the Makefile target:

make benchmark

The benchmark script attempts to import optional comparison frameworks such as PyTorch and TensorFlow. Missing optional frameworks are skipped rather than failing the MiniTensor benchmark.

Interpreting results

Performance numbers are only meaningful when the workload matches your use case. Small tensors can be dominated by Python call overhead and allocation costs, while large tensors are more likely to show the Rust engine, SIMD, memory layout, and backend behavior.

When comparing with another library, verify that both implementations use the same:

  • dtype and shape;

  • device/backend;

  • operation semantics;

  • thread count or backend scheduling policy;

  • warmup and synchronization points;

  • data-transfer policy between host and accelerator memory.

Optimization checklist

  • Prefer vectorized tensor operations over Python loops.

  • Keep tensors contiguous before expensive operations when possible.

  • Reuse tensors and avoid unnecessary conversions to and from NumPy.

  • Use GPU backends for workloads large enough to amortize transfer and launch overhead.

  • Batch many small operations into fewer larger operations when practical.

  • Run release builds for performance measurements; debug builds are for correctness debugging, not speed.

Profiling pointers

Start with the highest-level benchmark that reproduces the slowdown. Then narrow it down to a specific operation, input shape, dtype, and backend. For Rust-side work, combine targeted Rust tests or examples with standard profilers available on your platform. For Python-side work, compare the cost of tensor creation, operation execution, NumPy conversion, and training-loop overhead separately.