2011年12月13日火曜日

RAMGRAPH: large scale graph processing in RAMCloud

http://code.google.com/p/ramgraph/

Recently, the pay-as-you-go computing paradigm in public clouds such as Amazon EC2 offers users to execute their computational tasks on the rented virtualized computation resources. Due to the lower cost than hosting a private cloud, this paradigm has become popular for medium- and small-sized Web companies such as Alexa and SmugMug. Mining and processing the large web and social networks is one of the common regular tasks for those Web companies. For example, a search engine typically uses a graph-based ranking scheme such as PageRank to give an order to the pages. Moreover, those mining and processing tasks are highly customized with various user-defined logics applied on the entire graph. The requirement of high efficiency in large scale graph processing challenges existing tools. This is because, most of current tools such as MapReduce and Hadoop are disk-based, where the hard disk is typically 100-1000x slower than the main memory. The random access nature of graph processing further harnesses the performance of disk-based graph processing. The low efficiency of current disk-based systems limits the popularity of graph mining and business intelligence in Web companies, and can result in low utilization of existing investment and lose potential business opportunities.

To unleash the computation power of current cloud offerings, we propose Thunder, a large graph processing system in the main memories of hundreds or thousands of machines. Thunder stores and processes graphs entirely in the main memory, and uses hard disks only for backup and archrivals. The system provides APIs similar to Map and Reduce functions in MapReduce for users to implement their user-defined logics. These logics are automatically executed on the graph in a distributed manner.

The goal of designing and implementing Thunder is to exploit the advantage of in-memory processing, while remaining all the merits of conventional disk-based tools, namely excellent scalability, good fault-tolerance and ability of expressing arbitrary and complex customized logic. In particular, we are facing challenging issues like scalability, availability, complex memory management, network traffic reducing and so on.

0 件のコメント:

コメントを投稿