ラベル survey の投稿を表示しています。 すべての投稿を表示
ラベル survey の投稿を表示しています。 すべての投稿を表示

2012年2月11日土曜日

2011年12月13日火曜日

RAMGRAPH: large scale graph processing in RAMCloud

http://code.google.com/p/ramgraph/

Recently, the pay-as-you-go computing paradigm in public clouds such as Amazon EC2 offers users to execute their computational tasks on the rented virtualized computation resources. Due to the lower cost than hosting a private cloud, this paradigm has become popular for medium- and small-sized Web companies such as Alexa and SmugMug. Mining and processing the large web and social networks is one of the common regular tasks for those Web companies. For example, a search engine typically uses a graph-based ranking scheme such as PageRank to give an order to the pages. Moreover, those mining and processing tasks are highly customized with various user-defined logics applied on the entire graph. The requirement of high efficiency in large scale graph processing challenges existing tools. This is because, most of current tools such as MapReduce and Hadoop are disk-based, where the hard disk is typically 100-1000x slower than the main memory. The random access nature of graph processing further harnesses the performance of disk-based graph processing. The low efficiency of current disk-based systems limits the popularity of graph mining and business intelligence in Web companies, and can result in low utilization of existing investment and lose potential business opportunities.

To unleash the computation power of current cloud offerings, we propose Thunder, a large graph processing system in the main memories of hundreds or thousands of machines. Thunder stores and processes graphs entirely in the main memory, and uses hard disks only for backup and archrivals. The system provides APIs similar to Map and Reduce functions in MapReduce for users to implement their user-defined logics. These logics are automatically executed on the graph in a distributed manner.

The goal of designing and implementing Thunder is to exploit the advantage of in-memory processing, while remaining all the merits of conventional disk-based tools, namely excellent scalability, good fault-tolerance and ability of expressing arbitrary and complex customized logic. In particular, we are facing challenging issues like scalability, availability, complex memory management, network traffic reducing and so on.

Tools for Large Graph Mining

http://www.cs.cmu.edu/~deepay/thesis.pdf

Graphs show up in a surprisingly diverse set of disciplines, ranging from computer networks to sociology, biology, ecology and many more. How do such “normal” graphs look like? How can we spot abnormal subgraphs within them? Which
nodes/edges are “suspicious?” How does a virus spread over a graph? Answering
these questions is vital for outlier detection (such as terrorist cells, money laundering rings), forecasting, simulations (how well will a new protocol work on a realistic
computer network?), immunization campaigns and many other applications.
We attempt to answer these questions in two parts. First, we answer questions
targeted at applications: what patterns/properties of a graph are important for solving
specific problems? Here, we investigate the propagation behavior of a computer virus
over a network, and find a simple formula for the epidemic threshold (beyond which
any viral outbreak might become an epidemic). We find an “information survival
threshold” which determines whether, in a sensor or P2P network with failing nodes
and links, a piece of information will survive or not. We also develop a scalable,
parameter-free method for finding groups of “similar” nodes in a graph, corresponding
to homogeneous regions (or CrossAssociations) in the binary adjacency matrix of the
graph. This can help navigate the structure of the graph, and find un-obvious patterns.
In the second part of our work, we investigate recurring patterns in real-world
graphs, to gain a deeper understanding of their structure. This leads to the development
of the R-MAT model of graph generation for creating synthetic but “realistic” graphs,
which match many of the patterns found in real-world graphs, including power-law
and lognormal degree distributions, small diameter and “community” effects.

Temporal analysis of semantic graphs using ASALSAN ∗

http://csmr.ca.sandia.gov/~tgkolda/pubs/bibtgkfiles/ICDM07.pdf

ASALSAN is a new algorithm for computing three-way
DEDICOM, which is a linear algebra model for analyzing
intrinsically asymmetric relationships, such as trade among
nations or the exchange of emails among individuals, that
incorporates a third mode of the data, such as time. ASALSAN is unique because it enables computing the three-way
DEDICOM model on large, sparse data. A nonnegative version of ASALSAN is described as well. When we apply these
techniques to adjacency arrays arising from directed graphs
with edges labeled by time, we obtain a smaller graph on latent semantic dimensions and gain additional information
about their changing relationships over time. We demonstrate these techniques on international trade data and the
Enron email corpus to uncover latent components and their
transient behavior. The mixture of roles assigned to individuals by ASALSAN showed strong correspondence with
known job classifications and revealed the patterns of communication between these roles. Changes in the communication pattern over time, e.g., between top executives and
the legal department, were also apparent in the solutions.

Survey (2)

Closeness Centralityの高いノードを発見する高速アルゴリズム
http://www.ieice.org/ken/paper/20111021b0j3/

グラフマイニングとその統計的モデリングへの応用

グラフマイニングとその統計的モデリングへの応用

http://www.ism.ac.jp/editsec/toukei/pdf/54-2-315.pdf

逆探索法によるグラフ系列マイニングの高速化
http://db-event.jpn.org/deim2011/proceedings/pdf/b10-2.pdf


グラフ系列マイニング
http://latent-dynamics.net/02/13_Inokuchi.ppt.pdf

頂点により誘導される頻出グラフ系列パターンのマイニング
http://www-erato.ist.hokudai.ac.jp/html/php/seminar5_docs/inokuchi.pdf

多頻度グラフマイニングを利用した動画の解析
http://www.ieice.org/ken/paper/20081218kahU/
多頻度グラフマイニングとは,大量のグラフ構造から,頻出するグラフパターンを意味のあるものとして抽出する手法である.本研究では,多頻度グラフマイニングを動画像の解析に用いる.本稿では,監視カメラの前を移動物体が通過するような動画を対象に多頻度グラフマイニングを用いてグラフベースの背景除去手法を提案する.具体的には,各ビデオフレームを領域分割し,各領域を節点,各領域間の隣接関係を辺としてグラフ構造を作る.この時,背景部分が頻出パターンとなることを利用して背景除去を行
う.

非整列RNA配列群からの頻出パターンのマイニング
http://www.mizuho-ir.co.jp/publication/giho/pdf/001_10.pdf

ページ閲覧時間を考慮した Web ログマイニング手法の提案
http://www.aiet.ecei.tohoku.ac.jp/~mihara/reference/refICS148.pdf
近年,Web サイトの複雑化・大規模化に伴い,Web サイト管理者がユーザのニーズ
に合わせて Web サイトを適宜改善していく必要が高まっている.このユーザのニーズを
知る手がかりの一つとして有用なのが Web アクセスログである.Web アクセスログを解
析することで,ユーザの Web サイト内での行動の様子を知ることができる.本稿では,
Web アクセスログ解析にデータマイニングを応用する Web ログマイニングに関連して,
ユーザが各ページを閲覧した時間を考慮したアクセスパターンを抽出する手法を提案す
る.

グラフ分析を利用した文書集合からの話題構造マイニング
http://sc.chat-shuffle.net/paper/uid:110007380661

グラフマイニングを応用した系列データ解析
http://www.ai-gakkai.or.jp/jsai/conf/2008/program/pdf/100318.pdf

メタデータ付き推薦のためのグラフマイニング
https://kaigi.org/jsai/webprogram/2011/pdf/231.pdf


時系列データを用いたWebグラフマイニング
http://db-event.jpn.org/deim2009/proceedings/files/B9-2.pdf

背景の分割に対応したグラフマイニングベースの動画像からの背景除去
Graph Mining based Background Removal from Videos Solving Background Separation

グラフマイニングアルゴリズムを用いたギャップを含むコードクローン情報の生成
http://iss.ndl.go.jp/books/R000000004-I10809124-00
http://ci.nii.ac.jp/naid/110008584014


グラフマイニングを用いたブロードバンドサービスエリア分類手法の一考察
https://kaigi.org/jsai/webprogram/2010/paper-422.html

多段GBIツールを利用した化学物質の部分構造情報の有効利用
http://www.sccj.net/event/nenkai/2005au/program/abstract-pdf/2O-02.pdf

線形グラフのマイニングアルゴリズム
http://www-erato.ist.hokudai.ac.jp/html/php/symposium3_docs/tabei-preview-20101129.pdf

2009年5月18日月曜日

GPU 上の MapReduce

PACT 2008 (Parallel Architectures and Compilation Techniques) で発表された GPU 上で MapReduce のランタイムを実装する研究。

「Mars: A MapReduce Framework on Graphics Processors」
http://www.cse.ust.hk/catalac/papers/mars_pact08.pdf

スタンフォード大学の Christos 等は Cell や Multi-Core で MapReduce を作る研究をやっているので、最近はやりのプロセッサでは一通りやられた感がある。

だが、異機種混在環境 (Cell + GPU など) などでの MapReduce はまだまだ着手されていないだろう。