Multiplying matrices in CUDA using shared memory tiled technique

In the below homework code I need to find out what more boundary conditions are required for the tiled matrix multiplication to work. Please help me out I have tried for a week to find out what the problem is ?

#include    <wb.h>

#define TILE_WIDTH 16


__global__ void matrixMultiplyShared(float * A, float * B, float * C,

                     int numARows, int numAColumns,
                     int numBRows, int numBColumns,
                     int numCRows, int numCColumns) {

//@@ Insert code to implement matrix multiplication here
//@@ You have to use shared memory for this MP
  __shared__ float s_A[TILE_WIDTH][TILE_WIDTH];
  __shared__ float s_B[TILE_WIDTH][TILE_WIDTH];

  int tx= threadIdx.x; int ty = threadIdx.y;
  int bx= blockIdx.x ; int by = blockIdx.y ;

  int Row = by*TILE_WIDTH + ty;
  int Col = bx*TILE_WIDTH + tx;



   if((Row < numARows  ) && (Col < numBColumns )) {


 float Pvalue =0.0;
  for (int m = 0; m < (numAColumns-1)/TILE_WIDTH+1; ++m) {

    if((Row < numARows) && ( (m*TILE_WIDTH+tx) < numAColumns)) {

      s_A[ty][tx] = A[Row*numAColumns +( m*TILE_WIDTH+tx)];
    }
    else
    {
      s_A[ty][tx] = 0.0;
    }
    if(((m*TILE_WIDTH+ty) < numBRows) && (Col < numBColumns)) {

      s_B[ty][tx] = B[(m*TILE_WIDTH+ty)*numBColumns+Col];
    }
    else
    {
      s_B[ty][tx] = 0.0;
    }
    __syncthreads();

   if((Row < numARows  ) && (Col < numBColumns )) {
    for (int k = 0; k < TILE_WIDTH; ++k)
      {

         Pvalue += s_A[ty][k] * s_B[k][tx];

      }
    __syncthreads();

  }
  }
    if((Row < numARows  ) && (Col < numBColumns )) { 
       C[Row*numCColumns+Col] = Pvalue;
    }
}
else
return;
}

Answers


The problem is in the condition for entering into the loop for loading the TILE, that is, if((Row < numARows ) && (Col < numBColumns )), and also next time again checking the same condition while doing the actual computation for the resulting element for each loaded TILE, only the last condition, that is while writing to the global is enough. you can find a detailed implementation for reference here


Need Your Help

How can I place two images side-by-side with Asciidoctor?

asciidoc asciidoctor

I'm trying to place two images side-by-side, and ideally center the whole block in an Asciidoctor project. The code below works in the HTML5 output, but stacks the images when rendered to a PDF.

c++ getline doesn't get the input

c++ codeblocks cin getline

i am trying to input a line and then an integer then a line again however when it the last cin gets the line it and i press enter it crashes or outputs randomly to the infinity. whats wrong?