2011年12月1日木曜日

GPU-based SSSP

http://www.ijcaonline.org/volume10/number10/pxc3871992.pdf
Performance Improvement in Large Graph Algorithms on GPU using CUDA: An Overview, 2010

The basic operations on the graphs with millions of vertices are common in various applications. To have faster execution of such operations is very essential to reduce overall computation time. Today’s Graphics processing units (GPUs) have high computation power and low price. This device can be treated as an array of Single Instruction Multiple Data (SIMD) processors using CUDA software interface by Nvidia. Massively Multithreaded architecture of a CUDA device makes various threads to run in parallel and hence making optimum use of available computation power of GPU. In case of graph algorithms, vertices of the graphs are processed in parallel by mapping them to various threads on device. By making thousands of threads to run in parallel, computation time required for these algorithms is drastically decreased as compared to their CPU implementation.
We studied different parallel algorithms for Breadth first search, all pairs shortest path that are carried out on GPU using CUDA and make their comparative study with respect to execution time, data structure used, input data etc. In the paper, we presented overview of various parallel methods carried out on GPU using its multithreaded architecture for BFS, APSP by various authors.


http://ppl.stanford.edu/papers/ppopp070a-hong.pdf
Accelerating CUDA Graph Algorithms at Maximum Warp, 2007
Graphs are powerful data representations favored in many compu- tational domains. Modern GPUs have recently shown promising re- sults in accelerating computationally challenging graph problems but their performance suffers heavily when the graph structure is highly irregular, as most real-world graphs tend to be. In this study, we first observe that the poor performance is caused by work imbal- ance and is an artifact of a discrepancy between the GPU program- ming model and the underlying GPU architecture. We then propose a novel virtual warp-centric programming method that exposes the traits of underlying GPU architectures to users. Our method signif- icantly improves the performance of applications with heavily im- balanced workloads, and enables trade-offs between workload im- balance and ALU underutilization for fine-tuning the performance.
Our evaluation reveals that our method exhibits up to 9x speedup over previous GPU algorithms and 12x over single thread CPU execution on irregular graphs. When properly configured, it also yields up to 30% improvement over previous GPU algorithms on regular graphs. In addition to performance gains on graph algo- rithms, our programming method achieves 1.3x to 15.1x speedup on a set of GPU benchmark applications. Our study also confirms that the performance gap between GPUs and other multi-threaded CPU graph implementations is primarily due to the large difference in memory bandwidth.


http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F6067545%2F6075092%2F06075214.pdf%3Farnumber%3D6075214&authDecision=-203
A modified parallel approach to Single Source Shortest Path Problem for massively dense graphs using CUDA, 2011
Today's Graphics Processing Units (GPUs) possess enormous computation power with highly parallel and multithreaded architecture, and the most attractive feature being their low costs. NVIDIA's CUDA provides an interface to the developers to use the GPUs for General Purpose Parallel Computing. In this paper, we present a modified algorithm of Single Source Shortest Path Problem on GPU using CUDA. First, we modify the standard BELLMAN-FORD algorithm to remove its drawbacks and make it suitable for parallel implementation, and then implement it using CUDA. For dense graphs, our Algorithm gains a speedup of 10×–12× over the previously proposed parallel algorithm. Our Algorithm also accept graphs with negative weighted edges and can detect any reachable Negative Weighted Cycle, which widens its scope to accept generalized problems.


http://www.springerlink.com/content/a254839r565r176p/
CUDA Solutions for the SSSP Problem
We present several algorithms that solve the single-source shortest-path problem using CUDA. We have run them on a database, composed of hundreds of large graphs represented by adjacency lists and adjacency matrices, achieving high speedups regarding a CPU implementation based on Fibonacci heaps. Concerning correctness, we outline why our solutions work, and show that a previous approach [10] is incorrect.


http://ldc.usb.ve/~vtheok/cursos/ci6323/pdf/lecturas/Accelerating%20Large%20Graph%20Algorithms%20on%20the%20GPU%20Using%20CUDA.pdf
Accelerating Large Graph Algorithms on the GPU Using CUDA, Pawan Harish and P.J. Narayanan,

Largegraphsinvolvingmillionsofverticesarecommoninmanyprac- tical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA pro- gramming model. We present a few fundamental algorithms – including breadth first search, single source shortest path, and all-pairs shortest path – using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing $600. In some cases optimal sequential algorithm is not the fastest on the GPU architec- ture. GPUs have great potential as high-performance co-processors.


----

http://sc.chat-shuffle.net/paper/uid:110006828689
CUDAによる全点対最短経路問題の高速化(アプリケーション高速化,「ハイパフォーマンスコンピューティングとアーキテクチャの評価」に関する北海道ワークショップ(HOKKE-2008))
本論文では全点対最短経路(APSP:All-Pairs Shortest Path)問題をGPU(Graphics Processing Unit)を用いて高速化した結果を述べる.提案手法は,GPUで動作するプログラムをGPU向けの開発環境CUDA(Compute Unified Device Architecture)を用いて記述する.アルゴリズムには単一始点最短経路を繰り返し求める手法(SSSP反復法)を用いる.問題全体での逐次処理を減らしてより高い速度向上を得るために,1っのSSSP問題を1つのタスクとし,それらのタスクを並列処理する.さらに,共有メモリを用いてタスク間でデータを共有し,グローバルメモリの参照を削減する.結果,既存手法よりも3.5〜18倍高速であった.また,SSSP反復法の性能がグラフの特性に依存して変動することを示す.
http://hagi-www.ics.es.osaka-u.ac.jp/research/papers/200812_t-okuyam_ispa.pdf

0 件のコメント:

コメントを投稿