Going Further with CUDA for Python Programmers

[Henry]

Introduction

So welcome, this is going further with CUDA for Python programmers. As the name suggests, this won’t make too much sense unless you’ve got started with CUDA for Python programmers. The good news is that I have a video called Getting Started with CUDA for Python Programmers. So start there. It’s only a bit over an hour long. You might be surprised at how quick and easy it is to get started if you haven’t. So assuming that you have got started, today we’re going to be looking at the most important next step of taking advantage of CUDA, which is we’ve already learnt to take advantage of the thousands of threads that you can run simultaneously on a GPU.

Shared Memory

Today we’re going to learn how to take advantage of the incredibly fast memory. So, up till now, although we haven’t really talked about it, the memory we’ve been using is what’s called global memory.

[00:01:03]

It’s basically, think of this, so this is the same book we looked at last week, which we do recommend, Programming Massively Parallel Processes. And the stuff we’re covering today is largely covered in chapter 5 of that book. In this CUDA mode series, there’s a lecture from Thomas Wienermann, which goes through this and the previous chapter in some detail. And so that’s actually a good video to watch, maybe after this one or before this one, either order is fine. They’re covering similar material in different ways. The key thing to understand is, so that we’re looking at here today, is that if we look at this box here, which is basically you can think of as a GPU, and in the GPU, you have global memory. Global memory is basically what we’ve always been using so far, when we just put, when we say like .CUDA in PyTorch, it’s actually putting the tensor into global memory.

[00:02:09]

Global memory is pretty fast, compared to some other types of memory we might be used to, but it’s far from the quickest memory available on the GPU. In fact, this shared memory is much faster. Shared memory, however, is not global, which is to say that not all of the threads can see it. In fact, as this box indicates, shared memory is something that’s only available to the threads on a specific streaming multiprocessor, SM, or in CUDA programming world, within a block. So all of the different threads in a block can access shared memory. And the reason we care about this is that shared memory is about 10 times faster than global memory.

[00:03:06]

And because CUDA, with all of its simultaneous thousands of threads running at the same time, is so incredibly quick, because GPUs are so incredibly quick, the speed at which you access memory turns out to matter a whole lot. So being able to use this shared memory effectively is as important as being able to use the thousands of threads at the same time, simultaneously, is important. So in the last lecture, we focused on how to use all of those threads, and today we’ll focus on how to use shared memory. Those two things will get you quite a long way, in terms of creating pretty fast CUDA code. Okay, so the repo for these lectures is the CUDA mode repo, and specifically the CUDA mode slash lectures repo, and in there you’ll find there’s a lecture 5.

[00:04:09]

You don’t have to have seen all the lectures beforehand, but they’re certainly all useful. Just need to have seen lecture 3, which is the one I mentioned. Lecture 5 is where today’s notebook will be found. And here it is. Okay, so one thing I’ll just mention that I’ve added is a little utils.py, where some of the stuff that we used last time, and we’ll use quite a bit, I’ve just put it all into a script so we can access it multiple times. So we’ve got the cdiv, which is ceiling division function, the little load inline wrapper called load CUDA, and the little kind of prefix we have that has the hash includes and the stuff we’re going to need there. And so you’ll see that we’re going to import those here.

[00:05:05]

Other than that, we’re going to import all the usual stuff that we like to use. Last time we used simple namespace, but actually thought let’s make things closer to how CUDA does things, and let’s create a little thing called dim3. So this is a 3D grid with an x, a y, and a z, using the handy little Python named tuple functionality. So here’s a nice way for us. And we can, just like in CUDA, provide as many of the dimensions as we want. Today we’ll be doing two-dimensional grids. So they’ll be implicit z equals 1. So we can access these as d.x and d.y, for example.

[00:06:04]

Like before, we’ll use Wurlitzer to print stuff from CUDA if we want to. CUDA launch blocking is helpful for debugging, so you can turn that on if you want to. And so today we’re going to do a matrix multiplication of a 5120 by 256 matrix M1 by a 256 by 5120 matrix M2. This approach of going from … actually, that’s not true. So like before, we’re going to start by looking at pure Python. And so pure Python is going to be really, really slow. So to handle that, we’re going to create a sample of the first matrix with the first four rows, and a sample of the second matrix with the first four columns. And so we use that for our pure Python example.

[00:07:01]

All right, so just to remind you, what we’ve already done in the past is we created this simple kernel runner that goes through every block and every thread. Not real blocks and threads, they’re actually just integers. And calls some kernel, which is not actually a kernel, it’s just a function. And I’m going to use dim3 now, now that we’ve got it, to pass into that. And so this was our previous matrix multiplication. We grab the row, we grab the column from the indexes we passed in. We have a guard. And we then accumulated our dot product for whatever particular row and column we’re filling in. So this is basically the, ignore the extra details here, but conceptually we’re just doing this dot product, for example, to fill in this is.

[00:08:03]

So if we’re filling in this here, then this is R, and this is C. So this is R, C. And so we’re doing the dot product between that column and that row. And so that’s what this looping is here. So I is going through all of the elements of the row and the column. And multiplying, adding, and then putting that into the output. So that’s what we do. We have a so-called kernel. And then we created something that would call the kernel, by calling our kernel runner. Passing in the function. We need our blocks and our threads per block, which are just dim3 tuples. And then we pass in our flattened data and any other information that’s required.

[00:09:06]

And so we can check that that matrix multiplication result, using these small sample versions, is close to the PyTorch version. And it is. And then we also looked at the CUDA version. And the CUDA version we created by pasting the kernel into chat-gpt. And it spat out something which we hardly had to change at all to get this. And the kernel runner also looks very similar. Except that the syntax for calling a kernel in CUDA is different, with this weird triple angle bracket. To make life a little bit simpler for myself, you might remember before we had the CPP source, where we would copy and paste that into a string.

[00:10:01]

I got a bit bored of doing that manually, so I created a little get-signature function that just uses a regular expression to automatically find that line of code. And so, for the rest of this lesson, I will be getting this CPP source automatically. That way I don’t have to worry about changing it. But you can see that regex is just returning the necessary line of code, plus the semicolon. So that makes life a little bit simpler, and I like being simple. OK, so then we go load CUDA. It ran very quickly, because I’ve already compiled this once, and PyTorch caches that. And this is actually another change I made since last time, is that in the loadCUDA function, if you don’t pass in a name, then it uses the function’s name. And that means that PyTorch will cache different versions of your code with different names for the various different things you’re doing.

[00:11:03]

So you won’t lose your cache each time, so that’s handy as well. OK, so now we can use the full matrices, because we’re going to be nice and fast. We need them to be contiguous in CUDA, so we’ll create m1c and m2c for that. And they should be similar to the result of PyTorch doing it, and it takes about 6 milliseconds. One thing I wondered about is how long of that 6 milliseconds was it actually running the matrix multiplication, compared to doing all this other stuff before it. So I just commented out those two lines of code and re-ran it, and that took about 50 microseconds. So that’s 0.05 milliseconds. So very little of this time is kind of overhead. Most of this is actually doing a matrix multiplication. So I think that’s an encouraging start.

[00:12:00]

Shared Memory in Python

OK, so how do we take advantage of shared memory? The problem here is that in our inner loop here, m and n are global memory. And so in this loop that’s happening k times, we are reading from global memory again and again and again. And that is a bit of a bummer. So there’s a better thing we could do, which is instead we could use shared memory. Now the problem is that shared memory is quite small. So we can’t just dump everything into shared memory for every single thread, because we’ve got lots and lots of threads running at the same time, or I should say for every block. So we’ve got lots of blocks running at the same time. If every one of them had the entire matrices in memory for every block, that’s going to be an enormous amount of data, and that’s going to be far too much for our GPU to handle.

[00:13:05]

So to deal with that, what we do is we do something called tiling. And a tile is basically, so we’re going to pick a tile width of 16. So whoever it says tile width here, we’re going to use 16. We basically say, OK, if we’re going to calculate this R, C thing here, right, instead of doing the entire dot product of all of this row and all of this column, what we could instead do is just grab the first little bit of that row, and the first little bit of that column. We could take the dot product of those and put them into R, C, and then we could do that again for the next tile across, and the next tile across, and so forth.

[00:14:02]

And this is what this dot, dot, dot here is, and the next tile across. And so then, you know, eventually we get to this bit of the row by this bit of the column, take the dot product of those, and add them up to the existing R, C output we’ve already got. And so that’s just, it’s doing exactly the same thing, but rather than doing the dot product all at once, we’re doing it one step at a time. That’s not interesting of itself, but what is interesting is you might notice that, let’s say for calculating this bit here, let’s say, so this is thread zero comma zero, we can do the same thing. We can take the first little bit of this, and the first little bit of this, and take their dot product, and that gives us the first piece we’re going to need of that one.

[00:15:06]

We can do that again, and again, and again, until eventually we get to this one, and we do this bit times this bit, and we keep going all the way to the end until there’s the final tile at the end. And once we’ve done that for all of the bits, eventually we’re going to have the correct answer in zero comma zero. Why is that interesting? Well, it’s interesting because we could reorder this. Rather than doing the whole first little bit of this row, and then the next bit of that row, and the next bit of that row, and the next bit of that row, instead what we could do is we could calculate zero comma zero for the first tile, and then we could calculate zero comma one for the first tile. And notice this zero comma one, it’s exactly the same row as we had before, right, but a different column.

[00:16:03]

Now with a normal kind of CPU style thinking, you’d say like, oh, maybe this will be in the cache, so this could be faster. That doesn’t work in GPU programming. In GPU programming, we instead use shared memory. So we could have put this into shared memory, and if we had done so, then the second time we use it, we don’t have to read it from global memory, it’s already there in shared memory. And then the same thing will happen when we get to the second row, right? We could put that into shared memory, and then we go the second row of that tile times the first column of that tile is needed to do one comma zero. And if you think about it, we’ve already accessed the first column of the tile in N. So if we had put that in shared memory as well, then we won’t have to get that from global memory either. So maybe you see where this is going. What we’re going to be able to do, actually, is before we do any work, is we’ll put this whole tile into shared memory, and we’ll put this whole tile into shared memory, and then we’ll take the matrix multiplication of the two tiles, and that will give us all of the first pieces of the entire tile output.

[00:17:21]

And then we’ll do the same for the tile one to the right of this one, and one underneath this one, and we’ll take the matrix product of those and add it to this again, and so forth until eventually, again, we get up to here, we put that whole tile into shared memory, we put that whole tile into shared memory, we take the matrix product, which again, remember, it’s just lots and lots of dot products, the column and row dot products. And so all of those are going to be able to use shared memory, and again, we add them to the outputs here. And so once we eventually do that for all of the tiles, we will have finished calculating these outputs.

[00:18:01]

So how many times did we read from global memory? Each of the input elements only got read from global memory once. And as soon as we grabbed it, all we did with it was we put it into shared memory, and then the actual dot product was entirely done from shared memory. And that’s how we make this faster. So to do that, let’s use Python, plain Python.

Dynamic Shared Memory

And we’re going to basically try to design something in Python that looks a lot like how CUDA is going to do it. And then we’re going to auto-generate CUDA, just like we have in the past. So, in CUDA, the kind of maximally flexible way to do things is what’s called dynamic shared memory, where you tell CUDA how much shared memory you’re going to want.

[00:19:05]

And it puts it aside for you, and then in basically one contiguous block with a pointer to that block that you will have access to, which is the same as an array. And then you basically grab from that block any of the pieces you want. In Python, we can do exactly the same kind of thing by using a trick which is true for both NumPy arrays and PyTorch tensors, which is that views into those tensors are writable. So if we create a tensor of length 5, and then we create a view of the first three elements and of the last two elements, called b and c. If we then modify b, it actually changes a, because they’re a view of the same memory. And if we change c, it’ll also change a. And so that’s going to be handy for us.

[00:20:00]

You’ll see why in a moment. We’re going to basically use our shared memory like that. Now, the thing is, we’ve got to restructure our kernel runner a little bit, because we have two steps now.

Shared Memory Kernel Runner

Step number one is copy all of our input into shared memory, and then step two is take the dot product. And so that doesn’t quite work with our previous approach, because we just have one big loop and we just have one thing that we do. So I’ve changed our kernel runner to create a shared memory kernel runner. I’ve still got the same loop through blocks.y, the same loop through blocks.x. This is all pure Python again. And here I’m going to create our shared memory. And so this is now going to be passed, the shared memory, into each function.

[00:21:00]

So all of our threads are going to have access to the same shared memory. Now we don’t actually create the threads here. So instead, step number one is in my kernel, I’m actually going to do the loop through the threads manually. We’ll improve this in a moment, don’t worry. It’s pretty messy with all this kind of duplicate code. But at least it’s nice and simple to understand. So first of all, let’s just run this and confirm we get the same answer as before. And we do. So let’s see what’s happening. The bit that does the running is exactly the same, except that I’m calling our new shared memory runner. And I’m also telling it the third thing you have to pass in is the shared size, is how much shared memory. So how much shared memory do we need? We need tileWidth times tileWidth, because that’s the size of the tile, is tileWidth by tileWidth.

[00:22:04]

But we’re going to need two of them, one for m and one for n. So the amount of shared memory we need is tileWidth times tileWidth times 2. So that’s what this is, tileWidth times tileWidth times 2. So that’s going to be passed in as the shared memory size. And that will be constructed here, torch.zeros. So that shared then gets passed into our kernel, our pretend kernel. And it’s just one big contiguous block of memory. So we have to grab the first share size bits, and that will be our m shared memory. So our two inputs are m and n. And everything from there onwards is going to be our n shared memory. So then what we do is we loop through. This is exactly the same as we had before.

[00:23:01]

In fact, I should use cdiv here to make it a bit more obvious what’s going on, cdiv. So we go through every element in the dot product we’re going to need. And so the indexing starts to get a bit complicated here. So ph is what the block, and therefore we, will use, which is basically the index of what tile are we up to. So we loop through each tile. So loop through each tile. So the number of tiles we’ll need is the size of the k dimension. So that’s the number of columns in m or the number of rows in n. And then divide that by the tile width. And that tells you how many tiles will fit. And we do a ceiling division to go all the way to the end.

[00:24:00]

So then we need to know. So let’s say we’re doing, again, we’re doing this r, c one here, right? So we need to know where this is. Where does it start? And the answer is that we’ve done ph lots of tiles so far. Each one has jumped across tw or tile width. So this distance here is ph times tile width. And we’re going to call that idx. So this is an important tip. I found I had a lot of trouble getting this to settled in my head until I drew it all out and wrote on my picture what everything is. So, and I’ve been doing this with the help also of my friend Karim, who works with me at answer.ai. And he found the same thing.

[00:25:01]

We were both like, our first attempts were both to do it just in code. And we did not get it working until we actually started drawing it out. And that’s when Karim and I actually were like, oh, okay, that all makes sense. So that’s what idx is, right? And so notice idx is that, but it’s also, because these are symmetric, it’s also that. That length is also idx. Okay, so now we need to fill in the shared memory. So we’ve got two sets of threads. One to fill in the shared memory, and one to do the matrix product. Let’s write that in. Fill shared to the dot products from shared. Okay, so we need to go through all of our threads. Find out what row and column we’re in.

[00:26:01]

So how do we find out what row and column we’re in? And again, these are the things that get complicated. So this is R, as we’ve already mentioned. So R is going to be equal to, look, there’s two pieces of it. There’s the idx piece, which goes from here to here. And then there’s also an additional piece, which is from here to here. What is that piece? Well, that piece is simply the coordinates of this grid location within the tile. And so remember that we are looping through, so blockdim.y and blockdim.x is the size of the tile. Right, so that means that, all right, so we’ve got tileRow and tileColumn. And so that’s what that is. So that’s, so therefore this here is tileRow, and this here is tileColumn.

[00:27:03]

And so therefore to find R, we have to add together idx plus tr. And here it is, idx plus tr. And that needs to be less than the second dimension of the matrix. And then here, we just need to index into it. So if this was a two-dimensional tensor, we could just do tr, tc. But it’s not, it’s one-dimensional, so we have to flatten out our dimension. So it becomes tr times tw, plus tc. So this is filling in our m-shared and n-shared, by going through all the possible elements of the tile and filling them all in. Okay, so after this bunch of loops is complete, ms and ns will simply contain a copy of the appropriate tile from m and n.

[00:28:09]

And again here, the indexing we’re doing is, so this, remember, is the kind of element that we’re up to in terms of the column. And this is the row that we’re doing. But we have to do times c, sorry, times k, in order to deal with the fact that we’ve flattened out our indexes. If one thing to think about that you might have been wondering is, what about this final tile that goes off the edge? So it’s not big enough. So what happens there? So for that final tile, we put in zeros. So we call that padding. And so they show that in the book here. So in this case, they’re doing a 4×4 matrix multiplication containing 2×2 grids.

[00:29:05]

And you can see here when we’re doing this one, we’ve actually, sorry, it’s a 3×3 using a 2×2 grid. So when we get to this piece here, it goes off the edge. So what happens when we go off the edge? We just put zeros in to the shared memory. And so that means then when we do the dot product between this one here containing zeros, and this one here containing zeros, then the zeros can just be ignored. They don’t do anything because they’re just zeros. So that’s why we put zeros in if we are outside the dimensions of the matrix for both m and n. So now that is filled in our shared memory, or our pretend shared memory. I mean it is shared memory, it’s just not any faster. Because we’re just in Python. And so now we’ve done that, we can go through all the threads.

[00:30:02]

Again, find out what row and column we’re in using exactly the same code. And then we can go through our tile width and aggregate all of the bits of our dot product. So why are we aggregating through tile width? Because the dot product will always be between tile width on this side, and tile width on this side. So every row from here, and every column from here, will be of size tw. So that’s why we do that. Okay, so that’s that rather messy tiled matrix modification in Python. But this is the place to start. So if you don’t understand anything, come back to here. Because you can run it through in the debugger. You can print out what the shared memory looks like.

[00:31:00]

You can make sure you understand exactly what’s going on because it’s plain Python. And so then all I’ve did is I basically said, Effectively that is saying, Oh, run this bit of code as all the threads. And then run this bit of code as all the threads. So just to refactor this a little bit, I created a run threads function that just says, okay, look through all the threads and call some function. And so using this approach, So with this function available, we can now change our loop. So that instead of having to do this big for loop, we can just say, Run this function in every thread and run this function in every thread. And so then those functions just contain the two pieces. So this is now going to get a bit closer to what the CUDA code is going to look like. The CUDA code is going to have something that says, Go through each tile and then fill the shared using all the threads.

[00:32:05]

And then do the dot product using all the threads. Okay, so this is identical to the last one. We’ve just refactored out the loops. So it’s going to get a little bit closer to what the final CUDA code will look like. The thing that calls it is identical. And of course, therefore, the result’s the same. Are there any questions so far?

[Jeremy]

Relationship between Blocks and Tile Size

I think he asked about the relationship between blocks in CUDA and the tile size.

[Henry]

Sure. Yeah, so in CUDA, a block is a… As we learned in the last one of these lectures, A block is just a kind of a conceptual thing that the CUDA programming model provides.

[00:33:02]

It’s just a bunch of numbers, basically, that are passed to your function as block IDX. And you know that all of the threads in a block will be running on the same SM, on the same streaming multiprocessor. What you do with that information is entirely up to you. And last time we did nothing with that information. This time we’re taking advantage of it to say like, OK, well, everything in a block has access to the same shared memory. So we decided that we will treat a block as something that is calculating one particular part of our output. A tile. So that’s what we called it. We just called it a tile. So a tile is just a semantic thing that we’re using. And by mapping that semantic idea of a tile to the CUDA programming model’s idea of a block, and basically saying, OK, we’re going to treat each block as a tile.

[00:34:06]

It’s going to allow us to use one block to calculate one tile in our output. And so therefore we’re going to have one sort of shared memory for each block which we’re mapping to each tile in our output. So you can kind of think of them as being the same thing. But they’re kind of conceptually different.

CUDA-like Approach

OK, so now we’re going to make it even more CUDA-like. Because actually CUDA code does not have a thing called run threads. It doesn’t look like this. Instead in CUDA code, there is no loop like this. But instead, all of these functions across all of these possible i0 and i1 coordinates are run at the same time.

[00:35:00]

I mean, not necessarily the same time, but they can be the same. They can all be the same time or some subset of the same time. Conceptually for the programming model, you think of them as all running at the same time. To do that in Python, we have to use threads.

Python Threads

Now, in real life, Python threads don’t actually all run at the same time, except in certain situations, at least with the current version of Python. Because there’s a thing called the global interpreter lock. They actually run one after the other. But again, for the programming model, we can ignore that. So we’re just going to pretend that they actually are in parallel. So to create threads, we use Python’s threading library. It has a thread class. And so let me show you a couple of interesting things here. I’ve got a function here called g that just prints whatever you pass it. And it prints the negative of whatever you pass it. And then it prints whatever you pass it times 10.

[00:36:02]

I’m going to call g using a bunch of threads. One convenient way to create and run a bunch of threads is with a thread pool executor. This is going to create three threads and run them at the same time, or as much as that this Python can handle. And so that threadpool.map basically will run all the numbers from 1 up to num and call our function g. So it’ll call this three times. So if I just comment out these mysterious lines of code, you can see what it does is it runs all of them for the first thread. And then all of them for the second thread. And then all of them for the third thread. This is not going to work for us because we want all of our threads to first complete the task of filling shared memory.

[00:37:08]

And then all of them to complete the task of doing the dot product. So we need to have a way to tell a thread to stop until all of the threads are up to this point. And in Python that thing is called a barrier. And so we can create a barrier, like so. And we can say create a barrier so that until three threads have hit that barrier, stop. So that’s what, and so then we’re going to pass that in, the sync barrier, sbsync barrier. And so it’s going to pause at the sync barrier until all the threads are here. And then pause at this sync barrier until all the threads are here. And now if you run it, you can see they all complete the first task. And then they all complete the second task. And then they all complete the third task.

[00:38:00]

And you’ll see they don’t necessarily do it in the same order, because threads can, you know, happen in any order. And so this is the trick which is going to allow us to have a single loop, where everything in that loop first does the shared memory filling in task, and then does the dot product task. So here is our new kernel runner. As before, it goes through each block. As before, it creates our shared memory. And it’s now going to create a synchronization barrier containing the number of threads. So threads per block y times threads per block x is how many threads there will be. And then we’re going to create a whole bunch of threads. One for every y, and one for every x. If you haven’t seen this before in Python, if you have two things in a list comprehension, it just does the Cartesian product of those. This will go through everything in y and everything in x.

[00:39:02]

And so o and p will be our two coordinates. So create a new thread. The function that it’s going to call is whatever function you asked for. And the arguments are going to be the coordinates of the block, the coordinates of the thread. And then we’ll say how many threads per block, pass in the shared memory, pass in the synchronization barrier, and any arguments you requested. And so now this looks like, actually, as you’ll see, like CUDA code. We can figure out what our row and column is using exactly the approach we saw before. Although now our row and column are actually going to be based on block idx and block dim, because this is actually telling us whereabouts we are.

[00:40:07]

The shared memory is exactly the same as before. And so again, we loop through all of our tiles. And again, we set the shared memory, just like before. But you’ll notice now we don’t need two separate loops. We just do the set the shared memory piece. And then we say wait for the synchronization barrier. So remember that this is happening for every output value in the tile simultaneously. Or at least as far as the programming model is concerned, it’s simultaneously. In fact, in Python it doesn’t do a good job of actually parallelizing it. And in CUDA we don’t know for sure if they’re happening exactly the same time. But as far as the programming model is concerned, we should think of them as happening at the same time. So all of these different coordinates are running conceptually at the same time.

[00:41:04]

And so when we hit wait here, that means that all of the threads have finished running those two lines of code. And so now we know that ms and ns are filled in for that tile. And so now we can go ahead and do the dot product. And then once every thread has done its own dot product, we then need to stop and wait until they’re all done. And then once they are all done, we can go ahead and fill in the next tile shared memory. This is very important to have this wait here. Because if this wait wasn’t here, then some of them will still be going ahead and doing the dot product. And others will be replacing the shared memory. And that was going to give you wrong answers. So you have to wait after you’ve completed the shared memory filling in. And you have to wait after you’ve completed doing the dot product.

[00:42:01]

OK, this code’s identical to before. And again, it’s giving us the same answer. So that is a good sign. So here’s the cool thing.

Auto-generating CUDA Code

I then took this code and I passed it into chat-gpt. And I said convert the following Python code to CUDA C. And I pointed out that you can remove these from the argument list. So we don’t need those in the argument list. I mean obviously you can manually remove this. But I just thought if I have one prompt that always works, I don’t have to do anything manually. I said change syncb.wait to sync threads. And I said for creating shared. So we’ll talk about all this in a moment. So I basically told it about the minor things that would have to change to convert the Python into CUDA C. And the thing it gave me worked first time. Although I did do some minor cleanups. This is the code it created after my minor cleanups.

[00:43:01]

So you’ll see now it’s getting, so it’s converted the, it recognizes that we need float arrays for our input and output matrices. It’s typed all of those things correctly. And so I, in my cleanup, I added a few things. So I’ve got now the, the tile column is thread idx.x. The tile row, thread idx.y. And then we’ve got R and C, just like before. Now CUDA, the way it does shared memory is a little bit weird. It doesn’t get passed in. Just like thread idx and block idx don’t get passed in. You just have to put this magic incantation in exactly one line of code in your kernel. And so here it is. Here’s this one line of code. And then following that, you say what data type you want your shared memory to be. And then you say what you want to call it. And that’s an array.

[00:44:01]

So this has created something called ms, which is the pointer to the start of the shared memory that CUDA is going to create. That’s what extern done to shared means. So ms is a pointer to the start of the shared memory. We need ns to be a pointer to the start of the second half of the shared memory. So go in tileWidth times tileWidth, because that will finish off the m part of the shared memory. That’s tileWidth by tileWidth. And the second half is the n part, which is tileWidth by tileWidth. So remember we did this in the Python version as well. So if any of this is confusing, go back to the Python version and step through it in a debugger. So now we’ve got ms and ns as our shared memory. And then this is exactly the same as the Python version. But we’ve now got this, syncThreads. So syncThreads __syncThreads __syncThreads is identical to the syncB.wait. It says wait until all of the threads are finished doing the previous lines of code before any of them are allowed to do the next one.

[00:45:20]

So because this stuff’s built into CUDA, we don’t have to create a sync barrier object and pass it in and all that. You just have this magic line of code. So there’s quite a bit of magic in CUDA, like this extend under shared and like this __syncThreads. But there’s not too many pieces. And you can see we’re basically using them all here. So the next part is then to call the kernel. And so when we call the kernel, we’ve got the normal triple angle brackets blocks, threads per block. And then we pass in one third argument to the triple angle brackets, which is how much shared memory do you want to create.

[00:46:00]

And so that is what’s going to be used automatically when it creates this shared memory that we get a pointer to here. That’s how much shared memory it will create. How much shared memory could you, should you create? Well, in this case, I’ve commented out this section. So ignore that for a moment. For now, we’re just going to do the same thing. We’re just going to make the tile width 16. So the amount of shared memory we need in bytes is tile width times tile width, for m times 2, for n as well, times size of float, because we don’t want bytes, we want floats. So that’s the amount of bytes of shared memory we need. And that’s what we pass in. Okay. So that’s basically that. And so we can then run that and we can see that we get the same result, which is good. One thing though, which is a bit of a worry, is that our time is actually slightly worse.

[00:47:02]

It’s gone from six milliseconds to six and a half milliseconds. So we’ll talk about that in a moment. I just want to mention one other thing that is in the book, um, they say, okay, for your size, you should write some function to calculate what size it should be. But they never say how to do that. And so in future lectures, we’ll be talking about how to think about things like this and how to design this correctly. But in this commented out section here, you can see the basic idea. So this will work if you run it, even though it’s not necessarily optimized. So you can call this special function cuda.getDeviceProperties passing in a structure to fill in. So and means a pointer to that. So that’s like a reference to the structure to fill in. And I think this is the device number, if I remember correctly. And it will return back a structure containing a number of things, including max threads per block.

[00:48:01]

So, and it’ll also give you shared memory per block. So you can use that to dynamically figure out threads per block, and to dynamically figure out your tile width and stuff like that. I’m not saying this is an optimized way to do any of those things. It’s just an indicative kind of showing you how you can get all the pieces you can need to calculate that. And so in later lectures, we will learn more about how to actually figure out what would be the optimal tile width and shared memory size. But for now, I’m just using 16. Okay, so this is the mystery part.

Dynamic vs. Static Shared Memory

The mystery part is, this is slower, as we saw. But if I take the exact same code, and instead I use, this thing where we tell it what size to create is called dynamic shared memory allocation. If we don’t use dynamic shared memory allocation, then we do that by not passing in the shared memory size here.

[00:49:09]

But instead, if we know at compile time how big we want our tiles to be, so we can, I’ve tried both 32 or 16, you can actually create an MS of TW by TW, and an NS of TW by TW. So you can have two separate things. And because we know we’re deciding at compile time what our tile width is, then this is not dynamically created. The rest of the code is the same, except now we’ve got proper kind of two-dimensional indexing, which is nice. And with this one, this is faster. So we’ve gone down from 6 to 5. And I think, if I remember correctly, when I tried 16 tile width, it’s a bit faster too. It’s more like 4. Um, maybe we’ll have that running in the background.

[00:50:03]

  1. Okay, let’s recompile that. Um, okay, any more questions before I move on?

[Jeremy]

So there’s one question from Gene Mesa. He asked why the size is like TW times TW times 2 times size of float. I think this was to the previous figure you had.

[Henry]

So we have to, the shared memory, we need to be able to store the tile for M and the tile for N. So each one of those is TW by TW. And so therefore we need 2 times TW by TW in order to have enough room for both of those two input tiles. And then we use it, um, here. We’ve got a pointer to the start of M.

[00:51:00]

We’ve got a pointer to the start of N. We also saw it in the Python version. The shared memory size we passed in. TW times TW times 2 because we needed the MS, M’s shared memory tile, and N’s shared memory tile. Okay, thank you for the question.

[Jeremy]

Did you find some documentation or some reference why the dynamic shared memory version is, or is this supposed to be this way? I’m a little bit surprised that it’s…

[Henry]

No, it’s a total mystery to me. Um, so maybe there’s something wrong with my code. I don’t know. So like this is something I think we should try to follow up on.

[00:52:01]

Maybe some of our friends at NVIDIA can tell me the dumb thing I did. Because, you know, I’m a newbie to all this stuff. So I probably did something stupid. But yeah, I’ve looked around. I’ve read around. I’ve searched. I’ve asked chatgpt. Nothing so far has told me. I found some other people who have said the same thing on the internet saying like, oh, why is my dynamic and static having different speeds? I haven’t found answers to any of those either. So yeah, this one’s a… The theory is that it definitely should not be slower. They should be identical. They should create exactly the same ptx code. So yeah, my guess is maybe I made a mistake in my code somewhere. So I will, if anybody figures this out, I will update the YouTube description to say what the answer turned out to be. Oh, hi there.

Future Message from Jeremy

Jeremy here with a message from the future. I figured out why that code was going slowly.

[00:53:00]

And the reason is because of this tiny little bit here. The problem is that when Tw, the tile width, is not known at compile time, it turns out that CUDA does not know how to create an optimized piece of code for a range of tile widths. So it falls back to the slowest possible version. I found a somewhat better solution than just supporting one constant tile width, which is… You can skip over this if you’re not interested. It’s a bit more advanced. But basically you can use a C++ template, and you can make tile width a template parameter instead of a normal parameter. When you do this, now you can only call it with a constant tile width, which is a bit ugly. But you can actually deal with that by basically supporting some fixed number of tile widths, and it will compile a separate version of the kernel for each one.

[00:54:05]

So I’ve got it here doing 8 and a 16 and a 32. So you could have something… So here I’ve just got tile width equals 16. But it’s a variable to kind of show it’s possible, and you could replace that with some code that calculates dynamically, whether you want to make it 8 or 16 or 32. Or you could do additional ones as well. And then there’s a lambda. This is how C++, very ugly, does lambdas. Looks quite different to Python lambdas, as you can see. But basically this is a lambda now, which will take… So this is the function that we’re going to call, and we’ll call that function using some particular kernel. This is the kernel function. Kf is the kernel function. Anyway, so lots of messiness there. It’s pretty hideous code. And I guess this is where it gets pretty complicated, when you actually want to have optimized kernels for a range of different hardware.

[00:55:00]

The good news is that at the moment, at least, any even reasonably modern NVIDIA GPU supports exactly the same amount of shared memory. So maybe all this dynamic stuff isn’t that necessary. Although having said that, I do think that you do need to change the tail width, depending on the matrix size, or the size of the matrices that you’re using. So yeah, I do think this is actually reasonably complicated to make it work well in lots of different situations. And I guess this is why there’s a whole team of people at NVIDIA, who work on doing this in kubelast and gdnn. So we don’t have to worry about it. Anyway, I’m glad I got it figured out, and I will now return you back to our scheduled programming. All right, so now I’ve got something really exciting to show, which is that we can do everything that we’ve just seen in a different library called number.

[00:56:10]

Number Library

Number is another way of writing cuda code. It’s a way of writing cuda code where you actually write the cuda code in Python. Here is the cuda code for our call. I haven’t tried this before, but we can actually see how long this takes to run. Okay, that’s interesting. So this one is slower still. So again, maybe I’m doing something weird. This is using the dynamic shared memory approach. So I’ve got two times tile width times tile width times, I just manually put four, which is how many bytes there are in a float. But still, it’s running, you know, it’s running at much more, it’s running at cuda speeds, which is good, even if it’s not the full speed we were getting from cuda.

[00:57:08]

Now, why would you do this? Because I mean, actually, if you look at the amount of code I have here, it’s not less code than the amount that I had here. So it’s not like, it’s not easier to write. I mean, so I’ve still got to use block idx, block dem, thread idx. So all these are now available in the cuda, what would that be, module, I guess. And they’re kind of globals available here. So we can create our shared array here. Because if you say shared.array zero, this is actually quite a new thing in Number. It does the dynamic approach. And so you can, so when you call the kernel, rather than using triple angle brackets, you use square brackets, passing in the blocks, threads per blocks, stream number, which we haven’t learned about yet, and the dynamic shared memory size.

[00:58:03]

And so here is where it creates it with the dynamic shared memory size. Tell it that you want them to be floats. And so now we can do the exact same trick that we did before, grabbing our ms and ns. Instead of underscore underscore sync threads, it’s cuda.sync threads. So in some ways, I’d say like, okay, this is not necessarily a big win. But there’s a couple of things that do make it a big win. So one I’ll show you, for instance, is, I mean, we could just do something pointless, like a times one here. There we go. So it should force it to recompile the kernel. Okay, run. There we go, done. So it took less than a second to recompile the kernel. So for some reason, which I don’t fully understand, compiling number kernels, cuda kernels, is way faster than compiling C and C++ cuda kernels.

[00:59:02]

And I have asked an NVIDIA guy about this, and he was like, oh, well, it’s just how it is. Like, there doesn’t seem to be an obvious way to make the C, C++ version faster. So I actually think this is great for doing development, is I can, you know, have actual cuda running and just change things and run it very fast. So I think that’s very handy. The second thing that’s very handy is that I don’t have to flatten my tensors. M and N here are being passed in, are actually M and N. The only thing I’ve done to them is wrapped them with as cuda array, which is a take zero time. It’s just changing the type, basically. So you don’t have to flatten it. So I can actually use proper indexing notation, which is convenient. That’s another nice thing. I can do things like dot shape. So I don’t have to pass in the H, K, and W, which again is quite nice.

[01:00:03]

So there’s some conveniences. But then I’m going to tell you the most amazingly cool thing, which is the Python thread thing we created back here, that kind of simulates threads and simulates cuda in Python, is fully built in to Number.

CUDA Simulator

So everything that we kind of recreated from scratch here actually already exists. And so in Number, to use it, if you Google for number cuda simulator, you’ll see here that if you set the environment variable number enable cuda sim to one, then that enables the cuda simulator. The code is executed as normal, except that it’s actually run on the CPU as pure Python code, just like ours was.

[01:01:02]

So you can, for example, set a breakpoint or print stuff directly from Python. Now notice, because this is not running cuda, it’s going to be slow. It’s going to be exactly as slow as our manual Python version, because this is just their Python manual Python version. I think it’s exactly as slow. So you still want to use much smaller subsets of your data. But this is a great way to actually, in my opinion, to do real world cuda development, is do it in Number. Do it all with number enable cuda sim set to one, with small amounts of data until everything works. And then, and you have to, by the way, you have to set that environment variable before you import number, right? So you would have it before you import number. And if you’re using a notebook, you’d have to reset the kernel, restart the kernel, and then change the environment variable, and then re-import number.

[01:02:08]

And then you can set it to zero. And now the exact same code will now be running on the GPU. And then I’ve tested this. If you then take your code and paste it into chat GPT and say, please convert this into cuda C code. For me, at least it did it perfectly correctly first time. So I think this is a really useful way to kind of combine all the stuff we’ve learned about from first principles, and we’ve done it all from scratch. And so we understand how it all works. But now to implement it in practice, maybe the easiest way to do that is actually to, yeah, use number. Now, of course, you don’t even need to convert it to C or C++. You could just leave it in number. The challenge is from a deployment point of view, it might be a bit more tricky. With PyTorch, if you use our load inline, load cuda approach, the documentation explains how you can pre-compile that and actually provide a pip or conda installable package that people can just use right away without having any of the cuda development toolkit installed.

[01:03:19]

For number, that’s not true. Having said that, if you do install number from conda, it automatically installs all the stuff you need for the toolkit. So maybe that’s okay. So anyway, these are some things like pros and cons to think about, you know.

Pros and Cons of Number

So maybe you just use number as is. Maybe it’s a little bit slower. So maybe that’ll be a problem. Or maybe you auto convert it to cuda C and actually use that in practice. Yeah. So I think that basically covers everything. Anything, any more questions or anything anybody wanted to add?

[01:04:00]

[Jeremy]

Discussion and Q&A

From my side, first of all, I must say this was super fantastic session. I learned a lot. And I didn’t expect that you would go so deep into like the Matmul stuff. Also, I think this shows so many elements of the cuda development. Starting from having this single character variables that you normally see, also maybe talks people, they see it for the first time, or it makes it a little bit unaccessible in the first class, because you just see this variables and they are plus and multiplied, whatever.

[Horace]

The magic happens. That’s what you showed this, like the starting from this drawing, all the memory, like basically the structures look like and what you really need to do and then reference back to this. Because normally the first approach, you already get something wrong.

[01:05:05]

[Henry]

Yes, marvelous. Thank you.

[Horace]

There are also a few things that stood out for me. I think if more people are interested in how to ship Numba kernels, that’s something we can certainly take a look at. As far as I know, I think Numba did have an ahead-of-time compilation mode, which should make stuff easier. Because it’s all Python, you can ship it, you just have to eat the JIT compilation cost.

[Henry]

Yeah, the AOT ahead-of-time compilation is deprecated as of, what is this, February 2024. They say they’re going to replace it with something newer and better, but they haven’t said what that is yet. So that’s currently an outstanding question. They did say they won’t remove the previous AOT approach until the new approach is working and everything. So yeah, hopefully by the time people watch this, if people watch this video on YouTube in the future, maybe this will all be fully resolved.

[01:06:02]

[Jeremy]

Comparison to Qt Plus and PyTorch

So we have a question here from Jonas, who wants to know if you have checked how it compares to Qt Plus, to the optimized CUDA library, or maybe also to the PyTorch matrix multiplication as a baseline.

[Henry]

I haven’t because I’m not a CUDA expert. I actually don’t know how to optimize all these, you know, shared memory sizes and tiles and blah, blah, blah. So, you know, I know that stuff like Qt Plus, you know, has a whole lot of, you know. So actually, you know, one of the things you might’ve noticed is I changed my input matrix sizes from last time. Last time I used the MNIST data and a kind of a pretend set of weights for MNIST. And so the output was 50,000 by 10. And that really kind of long, narrow matrix is particularly hard to optimize because like with a tile width of 32, for example, most of it’s padding.

[01:07:04]

So I actually used a kind of a simpler, not yet all shapes, to make this a bit easier for me to think about. So I think it’ll be a fun exercise as we go further to see if we can figure out for that like a one layer MNIST model, can we create something that is close in performance to what Qt Plus or Qt DNN does? Because yeah, I think it’s kind of, I found it quite tricky to think about how I would do all that automatically.

Future Directions

There are things like TVM, which, you know, maybe one day we can look at TVM together. I know Thomas Veneman says he’s used that. Yeah, so maybe he can even help us there, which is something which can kind of automatically optimize these and a lot more details for you. And then also I know like sometime, hopefully in the next month or so, so sometime around late February or March 2024, Mojo GPU should be becoming available, at least in a preview form.

[01:08:08]

And that has an auto-tune functionality which might help us to automatically find the best parameters as well. But yeah, for now I didn’t even bother checking because I suspected I was going to be quite embarrassed at how much further there is to go.

[Jeremy]

Community Collaboration

I think this is like a super opportunity for the community because you have this super-equivalent notebook which can be used, like everybody can fork it and play around, try out different stuff, like make performance measurements. And I’m pretty sure that somebody will make reports on things and experiment it and maybe we can look at what our current state is and look if we come closer to, yeah, what state of the art basically this matrix multiplication. And I think also the session today played the foundation for other stuff like especially current technology from TensorFlow, for example, from NVIDIA which is specifically in the hardware, further hardware features for assisting matrix multiplications.

[01:09:17]

Hardware Features

Yeah, so things we can’t look into in the future maybe.

[Henry]

Okay. So for what it’s worth, like I think getting something who’s less competitive is probably going to be very difficult at least like on basically the server skews of things like A100s.

[Horace]

I suspect that’s not as true for consumer GPUs.

[Henry]

So that’s potentially like people are used to this stuff because there’s going to be less attention for the video. Good point.

[Jeremy]

PyTorch Matrix Multiplication

So Mark G. Misa wants to know what is PyTorch currently using for matrix multiplications, I guess. Do you know? Yeah, so PyTorch mostly still uses like Google OS.

[01:10:02]

There is like experimental, if you use Torch Compile, there’s a flag called torch.inductor.config tritonmatmul. I think it’s off by default, but you know, that’s something like we’re potentially interested in exploring. But I think as far as today, it’s mostly still Google OS.

[Henry]

The way you can tell, by the way, is if you like launch the PyTorch profiler, there’s like the function and a lot of specific signatures to sort of say, okay, is it using Google OS? Is it using TensorForce? So like a profile trace as well but you don’t need to think about it.

Importance of Fast Compilation

And Jeremy, what you said, regarding the speed of compilation, for me personally, this is super important in getting, like experimenting with things. I was trying to optimize for this and I think it’s a big advantage if this works really much faster than you’re running in VCC.

[01:11:00]

Yeah.

[Jeremy]

What’s this like waiting for the next result until you see the figures?

[Henry]

Exactly. Yeah, you need a quick iteration loop. So I think between the CUDA simulator and the fast number CUDAJIT, it can make life a lot better.

Chat GPT Capabilities

So I think there was two more general questions regarding CGPT use. I think one was whether CGPT could, was fusing multiple clients or is this like where the limits basically are for what CGPT can do? Let’s try it. That’d be an interesting thing for people to experiment with, wouldn’t it? See how far we can go. I think in general people tend to underestimate maybe what chatGPT can do. So yeah, why don’t you try some things that maybe you suspect might be a bit too hard for it and we’ll find out where the limits are.

[01:12:02]

[Jeremy]

Shared Memory in Chat GPT

So for example, this __shared memory thing, was this added automatically by the uploader to QF2?

[Henry]

I put that prompt in, well, so when I converted number into CUDA, it automatically changed the number shared memory. In my prompt, I had something saying replace syncb.wait with __syncthreads. I didn’t try doing it without it. Maybe it would have guessed.

[Jeremy]

Future of Software Development

And there was another question regarding general capabilities of CGPT for codings. Do you think, what’s your expectation for developers in the future with the support we’ve replaced? Will we just specify like prompt things?

[Henry]

Personally, I haven’t found it at all useful for like anything remotely novel, you know? So I found it really useful for like doing, it’s like for using like languages I’m not that familiar with, or frameworks I’m not that familiar with.

[01:13:08]

And it kind of gets me, you know, or kind of calling some API. I find it really good for that. But for doing anything kind of at all algorithm related, which is anything other than just replicating a well-known algorithm, I haven’t, I found it terrible. So at least the kind of work I do, which is kind of because it’s research oriented. So most of what I write is pretty new. I haven’t found it. I haven’t found it at all useful. So at least I think my work’s safe for a while. I don’t know about yours. Yeah, I’m sure it depends on how quickly go further to AGI.

[Jeremy]

But I think it will play a different role in the future for software development. That’s for sure. We have another question regarding number and how it compares to Triton.

Comparison to Triton

I don’t know, do you have experience with Triton already?

[Henry]

Yeah, a little bit.

[01:14:00]

It’s, I’m not an expert. I know you guys know a lot more than me. So Triton’s different in that in Triton, you can have a matrix multiplication inside. You can have at inside your kernel. And Triton’s really clever at optimizing these much more sophisticated kind of things. So number’s a lot more basic, really. It’s just a mapping of the CUDA C concepts directly to Python concepts. You know, Triton is a fairly recent, literally a PhD thesis artifact. So it’s doing something a lot more clever. I mean, the similarities, you know, Triton, it works with a decorator. It converts, you know, Python code to GPU code.

[01:15:00]

I think they’re good for different things, you know, because like Triton is somewhat limited as well. Like it’s very, very smart, but it’s not a mapping of the entire CUDA programming model. So for example, when I know at Meta, when you guys, you know, Horace and those guys did the GPT fast thing and wanted to do four bit discretization, you know, you found that you couldn’t do it in Triton. And that’s because Triton doesn’t have any way to express that at the moment. So yeah, I think, you know, they both have their place. Do you guys feel the same way?

[Horace]

Yeah, so like, we are going to have like Charles, who was the author of the GPT fast kernel, is going to give us a talk in two weeks. So he’s going to sort of tell us the first time what happened. I think when people ask this question, the motivation is sort of, can you avoid learning CUDA?

[Henry]

Avoiding Learning CUDA

And after that, it’s like, that’s a negative. That’s same today, yeah.

[Horace]

They sort of like learn both and, you know, Triton will just be slightly easier to ship some sort of useful kernel.

[01:16:03]

But like, you’re going to sort of run into the limits of the language.

[Henry]

Yeah, and I’m not even sure it’s easier. I’m not even sure it’s easier to use Triton. Like, I feel like Triton is more complex in many ways. So like, once you know CUDA, then you can find the places where Triton can help. I think trying to use Triton effectively, if you didn’t know CUDA, would be an even harder path. Especially now that we’ve kind of figured out these ways of doing, you know, iterative notebook-based CUDA development, you know, which is one of the big benefits of Triton is just it’s a decorator and it’s Python and stuff. So, yeah.

[Jeremy]

Triton Experience

So I like Triton very much. I always joke that maybe OpenAI has a fixed version of it internally, where the kernel team can run three kernels, because I ran so often into strange things and unexpected.

[01:17:01]

Yeah, but also I think that there’s like iteration speed is similar to number. That’s a good positive point for writing.

Conclusion

Okay, so I think then this was a wonderful session today. I really thank you so much, Henry.

[Henry]

Thank you.

[Jeremy]

Thank you so much for the opportunity to work on this. And yeah, fantastic, really deep dive into matrix multiplication.

[Henry]

Bye, everybody. Thanks for joining.