New top story on Hacker News: Show HN: RunMat – runtime with auto CPU/GPU routing for dense math

Show HN: RunMat – runtime with auto CPU/GPU routing for dense math
9 by nallana | 2 comments on Hacker News.
Hi, I’m Nabeel. In August I released RunMat as an open-source runtime for MATLAB code that was already much faster than GNU Octave on the workloads I tried. https://ift.tt/ACh7Hqt Since then, I’ve taken it further with RunMat Accelerate: the runtime now automatically fuses operations and routes work between CPU and GPU. You write MATLAB-style code, and RunMat runs your computation across CPUs and GPUs for speed. No CUDA, no kernel code. Under the hood, it builds a graph of your array math, fuses long chains into a few kernels, keeps data on the GPU when that helps, and falls back to CPU JIT / BLAS for small cases. On an Apple M2 Max (32 GB), here are some current benchmarks (median of several runs): * 5M-path Monte Carlo * RunMat ≈ 0.61 s * PyTorch ≈ 1.70 s * NumPy ≈ 79.9 s → ~2.8× faster than PyTorch and ~130× faster than NumPy on this test. * 64 × 4K image preprocessing pipeline (mean/std, normalize, gain/bias, gamma, MSE) * RunMat ≈ 0.68 s * PyTorch ≈ 1.20 s * NumPy ≈ 7.0 s → ~1.8× faster than PyTorch and ~10× faster than NumPy. * 1B-point elementwise chain (sin / exp / cos / tanh mix) * RunMat ≈ 0.14 s * PyTorch ≈ 20.8 s * NumPy ≈ 11.9 s → ~140× faster than PyTorch and ~80× faster than NumPy. If you want more detail on how the fusion and CPU/GPU routing work, I wrote up a longer post here: https://ift.tt/2QCUVMt You can run the same benchmarks yourself from the GitHub repo in the main HN link. Feedback, bug reports, and “here’s where it breaks or is slow” examples are very welcome.

Hi, I’m Nabeel. In August I released RunMat as an open-source runtime for MATLAB code that was already much faster than GNU Octave on the workloads I tried. https://ift.tt/ACh7Hqt Since then, I’ve taken it further with RunMat Accelerate: the runtime now automatically fuses operations and routes work between CPU and GPU. You write MATLAB-style code, and RunMat runs your computation across CPUs and GPUs for speed. No CUDA, no kernel code. Under the hood, it builds a graph of your array math, fuses long chains into a few kernels, keeps data on the GPU when that helps, and falls back to CPU JIT / BLAS for small cases. On an Apple M2 Max (32 GB), here are some current benchmarks (median of several runs): * 5M-path Monte Carlo * RunMat ≈ 0.61 s * PyTorch ≈ 1.70 s * NumPy ≈ 79.9 s → ~2.8× faster than PyTorch and ~130× faster than NumPy on this test. * 64 × 4K image preprocessing pipeline (mean/std, normalize, gain/bias, gamma, MSE) * RunMat ≈ 0.68 s * PyTorch ≈ 1.20 s * NumPy ≈ 7.0 s → ~1.8× faster than PyTorch and ~10× faster than NumPy. * 1B-point elementwise chain (sin / exp / cos / tanh mix) * RunMat ≈ 0.14 s * PyTorch ≈ 20.8 s * NumPy ≈ 11.9 s → ~140× faster than PyTorch and ~80× faster than NumPy. If you want more detail on how the fusion and CPU/GPU routing work, I wrote up a longer post here: https://ift.tt/2QCUVMt You can run the same benchmarks yourself from the GitHub repo in the main HN link. Feedback, bug reports, and “here’s where it breaks or is slow” examples are very welcome. 2 https://ift.tt/5jvPaOb 9 Show HN: RunMat – runtime with auto CPU/GPU routing for dense math

weightlohealt

Search This Blog

New top story on Hacker News: Show HN: RunMat – runtime with auto CPU/GPU routing for dense math

Comments

Post a Comment

diet weight loss

helth