<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>GPU Magic</title>
	<atom:link href="http://gpumagic.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://gpumagic.wordpress.com</link>
	<description>Making GPUs do magical things!</description>
	<lastBuildDate>Tue, 11 Oct 2011 21:02:43 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='gpumagic.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://1.gravatar.com/blavatar/12fa1b3e2cff5d07a9a0e8b6c4938d6a?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>GPU Magic</title>
		<link>http://gpumagic.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://gpumagic.wordpress.com/osd.xml" title="GPU Magic" />
	<atom:link rel='hub' href='http://gpumagic.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Readings in GPU: &#8220;Understanding performance bottlenecks in numerical kernels on GPUs&#8221; by Vasily Volkov</title>
		<link>http://gpumagic.wordpress.com/2011/01/02/readings-in-gpu-understanding-performance-bottlenecks-in-numerical-kernels-on-gpus-by-vasily-volkov/</link>
		<comments>http://gpumagic.wordpress.com/2011/01/02/readings-in-gpu-understanding-performance-bottlenecks-in-numerical-kernels-on-gpus-by-vasily-volkov/#comments</comments>
		<pubDate>Sun, 02 Jan 2011 01:20:59 +0000</pubDate>
		<dc:creator>ajtsheppard</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://gpumagic.wordpress.com/?p=45</guid>
		<description><![CDATA[Volkov&#8217;s article is made up of a set of presentation slides, 53 in number, and specifically deals with performance bottlenecks for LU factorization and matrix multiplication. To the extent that these computations share numerical operations in common with other compute intensive tasks, the advice given in the article provides a general overview of performance bottlenecks [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gpumagic.wordpress.com&amp;blog=7436101&amp;post=45&amp;subd=gpumagic&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Volkov&#8217;s article is made up of a set of presentation slides, 53 in number, and specifically deals with performance bottlenecks for LU factorization and matrix multiplication. To the extent that these computations share numerical operations in common with other compute intensive tasks, the advice given in the article provides a general overview of performance bottlenecks on GPUs.</p>
<p>Of special mention is the fact that article deals only with <span style="text-decoration:underline;">single precision</span> computational bottlenecks. Even on Nvidia&#8217;s latest generation of GPUs (the &#8220;Fermi&#8221; series), double precision lags single precision performance by a factor of two or so. That said, much of what Volkov has to say is equally applicable to double precision.</p>
<p>Volkov&#8217;s motivation for the article &#8212; if any is ever needed for seeking improved computational performance &#8212; are comparative benchmarks from 2007, in which an Intel Core2 Quad 2.4GHz and a GeForce 8800 GTX are compared for matrix multiplication and LU factorization. The comparison is between Intel&#8217;s MKL (Math Kernel Library) and Nvidia&#8217;s CUBLAS (CUDA Basic Linear Algebra Subroutines). On a raw computational basis, the GPU has a roughly 5:1 advantage over the CPU. In practice, Volkov shows the advantage to be between 2:1 and zero! A rather disappointing result, but one that motivated Volkov to dig deeper into the problem.</p>
<p>GPUs obtain their raw computational power advantage over CPUs by having vastly more ALUs (Arithmetic Logic Units). However, this advantage in raw compute power can only be realized if the ALUs are all kept busy. Failure to fully occupy the ALUs will throw away this raw compute advantage; but to fully occupy the ALUs with a given computational task requires a good understanding of the GPU architecture.</p>
<p>Using the coding techniques suggested by Volkov yields impressive performance improvements over code written in a way less &#8220;sympathetic&#8221; to the low-level architecture of GPU. Moreover, the techniques appear to yield a higher fraction of theoretical peak performance over a range of GPU architectures from old to new (GeForce 8600 GTS, 8800 GTX, 9800 GTX, and GTX 280). The net result is about a doubling of performance over Nvidia&#8217;s CULAS code. And the same techniques suggested by Volkov have also yielded performance gains for other numerical kernels, such as for the FFT (Fast Fourier Transform).</p>
<p>Some of the bottlenecks pinpointed by Volkov will no doubt be improved in future GPU architectures. That will go some way to relieving the burden of optimizing performance from the application programmer. But anyone wanting to make their computations truly fly on GPUs would do well to be mindful of the how instructions and memory interact on GPUs at a very low level as discussed in Volkov&#8217;s article.</p>
<p>Volkov&#8217;s BLAS algorithms are now incorporated into newer versions of CULBAS.</p>
<p>Key takeaways:</p>
<ul>
<li>Raw computational power of a GPU scales with the number of cores.</li>
<li>Memory bandwidth on a GPU scales with the number of memory controllers.</li>
<li>GPUs have no scalar capability in that scalar and pointer operations use full vector length.</li>
<li>GPU supports variable vector length, with 512 being the maximum. Often, maximum performance can be achieved with vectors of shorter length; 64 elements, for example.</li>
<li>There is a hard ceiling of 2/3 of peak raw performance when using an operand with shared memory. Volkov contends that this is an inherent bottleneck in the GPU architecture.</li>
<li>Row pivoting in column-major layout on GPU is slow.</li>
<li>Don&#8217;t run small tasks (vectors and matrices with &#8220;N&#8221; and &#8220;M&#8221; dimensions less than several thousand) on GPU because the overhead isn&#8217;t worth it.</li>
<li>Large matrix operations can benefit significantly if spread across multiple GPUs.</li>
<li>Advice: Keep as much data as possible in registers, but bear in mind that doing so only helps with vector operations.</li>
<li>Advice: Use shared memory, which is smaller and slower than register memory, as a communication device only. It is not only an issue of memory access (fetch and store) because some instructions run slower (require more clock cycles) when using shared memory.</li>
<li>Advice: For scalar and pointer operations use shorter vectors, not longer. And, more generally, use shorter vectors not longer; indeed, divide longer vectors and their operations into shorter ones in application code, if necessary.</li>
<li>Advice: It is sometimes the case that it is better to reduce memory traffic by recomputing results rather than storing and retrieving them from memory.</li>
<li>Advice: Try fast algorithms that might fail occasionally; check for failure, and recompute if needed using a slower algorithm.</li>
<li>Advice: Use heterogeneous algorithms (of the type that can run on either the CPU or GPU) and make the switch to GPU only for problems that exceed a certain size, or based on other problem characteristics.</li>
<li>My advice: Question all assumptions, and all coding advice! If the cost of doing experiments with your code is low, do them.</li>
</ul>
<p>Notes:</p>
<ul>
<li>[1] &#8220;Understanding performance bottlenecks in numerical kernels on GPUs&#8221; by Vasily Volkov, May 21, 2010, can be found here: <a href="http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/volkov2010-NTU1.pdf">http://www.math.ntu.edu.tw/~wwang/mtxcomp2010/download/volkov2010-NTU1.pdf</a></li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/gpumagic.wordpress.com/45/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/gpumagic.wordpress.com/45/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/gpumagic.wordpress.com/45/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/gpumagic.wordpress.com/45/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/gpumagic.wordpress.com/45/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/gpumagic.wordpress.com/45/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/gpumagic.wordpress.com/45/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/gpumagic.wordpress.com/45/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/gpumagic.wordpress.com/45/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/gpumagic.wordpress.com/45/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/gpumagic.wordpress.com/45/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/gpumagic.wordpress.com/45/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/gpumagic.wordpress.com/45/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/gpumagic.wordpress.com/45/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gpumagic.wordpress.com&amp;blog=7436101&amp;post=45&amp;subd=gpumagic&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://gpumagic.wordpress.com/2011/01/02/readings-in-gpu-understanding-performance-bottlenecks-in-numerical-kernels-on-gpus-by-vasily-volkov/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/8be1ee7695be5ad0e1b23a1b202cbb82?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ajtsheppard</media:title>
		</media:content>
	</item>
		<item>
		<title>Readings in GPU: &#8220;CUBLAS Library&#8221; by Nvidia</title>
		<link>http://gpumagic.wordpress.com/2010/11/20/readings-in-gpu-cublas-library-by-nvidia/</link>
		<comments>http://gpumagic.wordpress.com/2010/11/20/readings-in-gpu-cublas-library-by-nvidia/#comments</comments>
		<pubDate>Sat, 20 Nov 2010 10:24:31 +0000</pubDate>
		<dc:creator>ajtsheppard</dc:creator>
				<category><![CDATA[CUBLAS]]></category>
		<category><![CDATA[CUDA]]></category>
		<category><![CDATA[Nvidia]]></category>
		<category><![CDATA[Readings in GPU]]></category>

		<guid isPermaLink="false">http://gpumagic.wordpress.com/?p=18</guid>
		<description><![CDATA[BLAS stands for &#8220;Basic Linear Algebra Subprograms&#8221; and is a collection of functions for linear algebra operations with vectors and matrices. It provides many of the basic building blacks for other numerical libraries, such as LAPACK (&#8220;Linear Algebra PACKage&#8221;). The &#8220;CUBLAS Library&#8221; [1] is Nvidia&#8217;s GPU/CUDA implementation (with contributions from Vasily Volkov, Davide Barbieri and from [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gpumagic.wordpress.com&amp;blog=7436101&amp;post=18&amp;subd=gpumagic&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>BLAS stands for &#8220;Basic Linear Algebra Subprograms&#8221; and is a collection of functions for linear algebra operations with vectors and matrices. It provides many of the basic building blacks for other numerical libraries, such as LAPACK (&#8220;Linear Algebra PACKage&#8221;). The &#8220;CUBLAS Library&#8221; [1] is Nvidia&#8217;s GPU/CUDA implementation (with contributions from Vasily Volkov, Davide Barbieri and from the University of Tennessee) of the BLAS library.</p>
<p>BLAS is divided into three levels: 1) BLAS1 for vector operations, 2) BLAS2 for vector-matrix operations, and 3) BLAS3 for matrix-matrix operations. For each level, functions are available in single, double and complex (single and double precision).</p>
<p>Like many GPU libraries, using CUBLAS requires and understanding of what&#8217;s going on at the hardware level. Because CPU memory is separate from GPU memory, as a programmer you are in the position of having to manage objects in two memory spaces. For this reason, CUBLAS provides some &#8220;helper functions&#8221; to simplify use of the library. These helper functions, such as <tt>cublasAlloc()</tt> to allocate memory on the GPU device, and <tt>cublasSetVector()</tt> to move a vector from CPU memory to GPU memory, all return a <tt>cublasStatus</tt> code. <tt>cublasStatus</tt> is defined as <tt>unsigned int</tt> in the header file <tt>cublas.h</tt> together with symbolic names for the success and failure return codes, such as <tt>CUBLAS_STATUS_SUCCESS</tt> and <tt>CUBLAS_STATUS_ARCH_MISMATCH</tt>. The library must be initialized before use by calling <tt>cublasInit()</tt>, and shutdown in an orderly fashion after use by calling <tt>cublasShutdown()</tt>. As the programmer is largely responsible for managing memory on both the CPU host and GPU device, care with allocating and freeing memory is essential.</p>
<p>A careful perusal of the <tt>cublas.h</tt> header file is informative in itself, but is essential if you want to really understand how to best call the library from your own code. For example, if you want to call CUBLAS from C# using P/Invoke you will need to pay careful attention to function signatures in <tt>cublas.h</tt>, as these two examples illustrate.</p>
<p><pre class="brush: csharp;">
// cublas.h: cublasInit() function signature.
cublasStatus CUBLASAPI cublasInit (void);

// C# wrapper for cublasInit()
[DllImport(&quot;cublas.dll&quot;,CallingConvention=CallingConvention.StdCall)]
public static extern cublasStatus cublasInit();
</pre></p>
<p><pre class="brush: csharp;">
// cublas.h: cublasAlloc() function signature.
cublasStatus CUBLASAPI cublasAlloc (int n, int elemSize, void **devicePtr);

// C# wrapper for cublasAlloc()
[DllImport(&quot;cublas.dll&quot;, CallingConvention = CallingConvention.StdCall)]
public static extern cublasStatus cublasAlloc(int n, int elemSize, out IntPtr devicePtr);
</pre></p>
<p>Given the history of BLAS and the fact that it was originally written in FORTRAN it uses vector and matrix indexing that starts at one, in contrast to typical C/C++ libraries that use indexing that starts at zero. CUBLAS is a C/C++ based library API, but the documentation does discuss how to call CUBLAS functions from FORTRAN. Another wrinkle associated with CUBLAS&#8217; FORTRAN past is that matrices are stored in memory using &#8220;column-major&#8221; indexing, which is opposite to the &#8220;row-major&#8221; indexing of C/C++. When passing matrices to CUBLAS functions be sure to pass them in column-major order. These choices &#8212; to use indexing that starts at 1 and column-major order &#8212; were made so that the CUBLAS has maximum compatibility with existing FORTRAN environments.</p>
<p>The library is distributed under the &#8220;Modified Berkeley Software Distribution License&#8221;.</p>
<p>Key takeaways:</p>
<ul>
<li>Care must be taken when using the CUBLAS library because vector and matrix indexing starts at 1 and matrices are stored in column-major order..</li>
<li>CUBLAS is a numerical library optimized for GPU using CUDA that is distributed under a liberal (in the sense of being largely unfettered) redistribution license.</li>
</ul>
<p>Notes:</p>
<ul>
<li>[1] The &#8220;CUBLAS library&#8221; is distributed as part of Nvidia&#8217;s &#8220;GPU Computing Toolkit&#8221; and includes the binary, header file (cublas.h) and documentation (CUBLAS_Library.pdf).</li>
<li>[2] You can learn more about BLAS on Wikipedia: <a title="Basic Linear Algebra Subprograms" href="http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms">Basic Linear Algebra Subprograms</a>. And over at Netlib: <a title="BLAS (Basic Linear Algebra Subprograms)" href="http://www.netlib.org/blas/">BLAS (Basic Linear Algebra Subprograms)</a>.</li>
<li>[3] Optimized versions for BLAS are also available for Intel and AMD CPUs.</li>
<li>[4] Sparse matrix versions of BLAS also exist.</li>
<li>[5] You can learn about the history of BLAS from the &#8220;oral histories&#8221; section of the Society for Industrial and Applied Mathematics (SIAM) website: <a title="The History of Numerical Analysis and Scientific Computing" href="http://history.siam.org/oralhistories.htm">The History of Numerical Analysis and Scientific Computing</a>. Jack Dongarra and Charles Lawson were key contributors in the development of BLAS.</li>
</ul>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/gpumagic.wordpress.com/18/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/gpumagic.wordpress.com/18/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/gpumagic.wordpress.com/18/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/gpumagic.wordpress.com/18/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/gpumagic.wordpress.com/18/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/gpumagic.wordpress.com/18/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/gpumagic.wordpress.com/18/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/gpumagic.wordpress.com/18/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/gpumagic.wordpress.com/18/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/gpumagic.wordpress.com/18/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/gpumagic.wordpress.com/18/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/gpumagic.wordpress.com/18/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/gpumagic.wordpress.com/18/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/gpumagic.wordpress.com/18/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gpumagic.wordpress.com&amp;blog=7436101&amp;post=18&amp;subd=gpumagic&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://gpumagic.wordpress.com/2010/11/20/readings-in-gpu-cublas-library-by-nvidia/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/8be1ee7695be5ad0e1b23a1b202cbb82?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ajtsheppard</media:title>
		</media:content>
	</item>
		<item>
		<title>Hello world!</title>
		<link>http://gpumagic.wordpress.com/2009/04/20/hello-world/</link>
		<comments>http://gpumagic.wordpress.com/2009/04/20/hello-world/#comments</comments>
		<pubDate>Mon, 20 Apr 2009 20:53:45 +0000</pubDate>
		<dc:creator>ajtsheppard</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false"></guid>
		<description><![CDATA[Hi, my name is Andrew Sheppard and welcome to my blog on GPU programming. I have a sister blog called “Multicore Magic” that you might also want to take a look at. Here you will find my thoughts and experiments with programming with GPUs. There is a bias towards application of GPUs in finance, but [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gpumagic.wordpress.com&amp;blog=7436101&amp;post=1&amp;subd=gpumagic&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Hi, my name is Andrew Sheppard and welcome to my blog on GPU programming. I have a sister blog called “Multicore Magic” that you might also want to take a look at.</p>
<p>Here you will find my thoughts and experiments with programming with GPUs. There is a bias towards application of GPUs in finance, but that’s simply because that’s where I do my work.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/gpumagic.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/gpumagic.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/gpumagic.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/gpumagic.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/gpumagic.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/gpumagic.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/gpumagic.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/gpumagic.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/gpumagic.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/gpumagic.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/gpumagic.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/gpumagic.wordpress.com/1/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/gpumagic.wordpress.com/1/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/gpumagic.wordpress.com/1/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=gpumagic.wordpress.com&amp;blog=7436101&amp;post=1&amp;subd=gpumagic&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://gpumagic.wordpress.com/2009/04/20/hello-world/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/8be1ee7695be5ad0e1b23a1b202cbb82?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">ajtsheppard</media:title>
		</media:content>
	</item>
	</channel>
</rss>
