www.semianalysis.com/p/nvidiaopenaitritonpytorch
4 Users
0 Comments
122 Highlights
1 Notes
Top Highlights
However, with the arrival of PyTorch 2.0 and OpenAI's Triton, Nvidia's dominant position in this field, mainly due to its software moat, is being disrupted.
Eager mode can be thought of as a standard scripting execution method. The deep learning framework executes each operation immediately, as it is called, line by line, like any other piece of Python code.
graph mode has two phases. The first phase is the definition of a computation graph representing the operations to perform. A computation graph is a series of interconnected nodes representing operations or variables, and the edges between nodes represent the data flow between them. The second phase is the deferred execution of an optimized version of the computation graph.
This is analogous to "interpreted" vs. "compiled" languages, like python vs. C++.
Compute (FLOPS): Running dense matrix multiplication within each layer
Memory (Bandwidth): Waiting for data or layer weights to get to the compute resources. Common examples of bandwidth-constrained operations are various normalizations, pointwise operations, SoftMax, and ReLU.
The obvious question is why don’t architects put more memory closer to the compute. The answer is $$$.
Even with heavy optimizations from leading researchers, 60% FLOPS utilization is considered a very high utilization rate for large language model training.
As such, one of the principal optimization methods for a model executed in Eager mode is called operator fusion.
the primary difference is that it adds a compiled solution that supports a graph execution model. This shift will make properly utilizing various hardware resources much easier.
the default software stack for machine learning models will no longer be Nvidia’s closed-source CUDA
Back to why PyTorch won. While there was an element of wrestling control away from Google, it was primarily due to its increased flexibility and usability of PyTorch versus TensorFlow.
The Google generative AI models are based on Jax, not TensorFlow.
DRAM has an order magnitude higher latency than SRAM (~>100 nanoseconds vs. ~10 nanoseconds), but it’s also much cheaper ($1s a GB vs. $100s GB.)
Even with heavy optimizations from leading researchers, 60% FLOPS utilization is considered a very high utilization rate for large language model training.
Memory follows a hierarchy from close and fast to slow and cheap. The nearest shared memory pool is on the same chip and is generally made of SRAM. Some machine-learning ASICs attempt to utilize huge pools of SRAM to hold model weights, but there are issues with this approach. Even Cerebras’ ~$2,500,000 wafer scale chips only have 40GB of SRAM on the chip. There isn’t enough memory capacity to hold the weights of a 100B+ parameter model.
1GB of SRAM on TSMC’s 5nm process node would require ~200mm^2 of silicon.
s increased flexibility and usability of PyTorch
MD and Tenstorrent are actively going to integrate into the software stack deeply.
The Google generative AI models are based on Jax, not TensorFlow.
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.