2010年11月25日木曜日

VLDB 2010

http://www.vldb2010.org/proceedings/files/a_research.xml

p.472: The Performance of MapReduce: An In-depth Study
Dawei Jiang (National University of Singapore, Republic of Singapore), Beng Chin Ooi (National University of Singapore, Republic of Singapore), Lei Shi (National University of Singapore, Republic of Singapore), Sai Wu (National University of Singapore, Republic of Singapore)

p.129: Energy Management for MapReduce Clusters
Willis Lang (University of Wisconsin-Madison, United States of America), Jignesh Patel (University of Wisconsin-Madison, United States of America)

p.81: MapMerge: Correlating Independent Schema Mappings
Bogdan Alexe (University of California Santa Cruz, United States of America), Mauricio Hernández (IBM Research, United States of America), Lucian Popa (IBM Almaden Research Center, United States of America), Wang-Chiew Tan (University of California Santa Cruz, United States of Americ

p.449: iGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques
Wook-Shin Han (Kyungpook National University, Republic of Korea), Jinsoo Lee (Kyungpook National University, Republic of Korea), Minh-Duc Pham (Kyungpook National University, Republic of Korea), Jeffrey Yu (The Chinese University of Hong Kong, People’s Republic of China)

p.340: On Graph Query Optimization in Large Networks
Peixiang Zhao (University of Illinois at Urbana-Champaign, United States of America), Jiawei Han (University of Illinois at Urbana-Champaign, United States of America)


p.330: Dremel: Interactive Analysis of Web-Scale Datasets
Sergey Melnik (Google, United States of America), Andrey Gubarev (Google, United States of America), Jing Jing Long (Google, United States of America), Geoffrey Romer (Google, United States of America), Shiva Shivakumar (Google, United States of America), Matt Tolton (Google, United States of America), Theo Vassilakis (Google, United States of America)

p.285: HaLoop: Efficient Iterative Data Processing on Large Clusters
Yingyi Bu (University of Washington, United States of America), Bill Howe (University of Washington, United States of America), Magdalena Balazinska (University of Washington, United States of America), Michael Ernst (University of Washington, United States of America)

p.264: Graph Pattern Matching: From Intractable to Polynomial Time
Wenfei Fan (University of Edinburgh, United Kingdom), Jianzhong Li (Harbin Institute of Technology, People’s Republic of China), Shuai Ma (University of Edinburgh, United Kingdom), Nan Tang (University of Edinburgh, United Kingdom), Yinghui Wu (University of Edinburgh, United Kingdom), Yunpeng Wu (National University of Defense Technology, People’s Republic of China)

p.276: GRAIL: Scalable Reachability Index for Large Graphs
Hilmi Yildirim (Rensselaer Polytechnic Institute, United States of America), Vineet Chaoji (Yahoo! Research Labs, United States of America), Mohammed Zaki (Rensselaer Polytechnic Institute, United States of America

p.220: High-Performance Dynamic Pattern Matching over Disordered Streams
Badrish Chandramouli (Microsoft Research, United States of America), Jonathan Goldstein (Microsoft Research, United States of America), David Maier (Portland State University, United States of America)

p.232: SECRET: A Model for Analysis of the Execution Semantics of Stream Processing Systems
Irina Botan (Eidgenössische Technische Hochschule Zürich, Switzerland), Roozbeh Derakhshan (Eidgenössische Technische Hochschule Zürich, Switzerland), Nihal Dindar (Eidgenössische Technische Hochschule Zürich, Switzerland), Laura Haas (IBM Almaden Research Center, United States of America), Renée Miller (University of Toronto, Canada), Nesime Tatbul (Eidgenössische Technische Hochschule Zürich, Switzerland)

p.244: Recognizing Patterns in Streams with Imprecise Timestamps
Haopeng Zhang (University of Massachusetts Amherst, United States of America), Yanlei Diao (University of Massachusetts Amherst, United States of America), Neil Immerman (University of Massachusetts Amherst, United States of America)


Amazon Web Service の記事

玉川さんによる Amazon Web Service の記事

2010年11月16日火曜日

プロジェクト体制のまとめ

今の研究室のプロジェクト体制を再度まとめましょう.

1. StreamGPU チーム (松浦君、上野君)
2. StraemGraphチーム (西井君、雁瀬君)
3. StreamCloud &StreamDS チーム(石井君、松浦君)
4. StreamScale チーム (Miyuru 君、上野君)

GPU における ゼロコピー通信

 昨日の議論で思い出したのだが、カーネル空間のソケットバッファからユーザー空間を経ずに直接 GPU のメモリにコピーするような機構が、データストリーム処理に対する GPU の適用には有効であるので、それを追求すれば良い。1年ぐらい前の議論になると思うが、再考する価値はあるでしょう。

[StreamGraph] ターゲットアプリケーション

どれがストリーム向きなのかを明確にするために、ターゲットアプリケーションをリストアップしていき、分類分けをしていきましょう。Graph チームの皆さんも補足してください。
  • グラフの中心性 (HITS, PageRank,マルコフ中心性)
  • クラスタリング(Betweennessアルゴリズム, KMeans, ...) 、セミクラスタリング
  • Triangle の抽出
  • 2部グラフの最大マッチング
  • 最短経路問題
  • グラフ直径
  • ...

[StreamGraph] Jung

Jung(Java Universal Network/Graph Framework)
http://jung.sourceforge.net/
ネットワーク(グラフ) 構造の分析や視覚化を行うためのJavaのOSSライブラリ.
グラフ理論、データマイニング、ソーシャルネットワーク分析のアルゴリズムを数多く実装
BSDライセンス

2010年11月15日月曜日

国内学会


[査読あり]
Webとデータベースに関するフォーラム (WebDB Forum) 毎年8月締め切り、11月開催

[査読なし]
電子情報通信学会 Web インテリジェンスとインタラクション研究会
次回は2011年3月

2010年11月8日月曜日

[StreamGraph] Related Papers

Internet: Diameter of the World-Wide Web, Arbert Barabashi, 1999

Graphs over Time: Densification Laws, Shrinking Diameters and Possible, Jon Kleinberg, KDD 2005

Scale-free characteristics of random networks: The topology of the world-wide web, Albert Barabashi, 2000

[StreamGraph] Random Walk with Restart

H. Tong, C. Faloutsos, and J.-Y. Pan. Fast Random Walk with Restart and Its Applications. In Proceedings of International Conference of Data Mining, 2006

A Unified Framework for Link RecommendationUsing Random Walks


[StreamGraph] Triangle Law

Evolution of the social network of scientific collaborations, Barabashi, 2001

The co-authorship network of scientists represents a prototype of complex evolving networks. In
addition, it offers one of the most extensive database to date on social networks. By mapping the
electronic database containing all relevant journals in mathematics and neuro-science for an eightyear period (1991-98), we infer the dynamic and the structural mechanisms that govern the evolution and topology of this complex system. Three complementary approaches allow us to obtain a detailed characterization. First, empirical measurements allow us to uncover the topological measures that characterize the network at a given moment, as well as the time evolution of these quantities.
The results indicate that the network is scale-free, and that the network evolution is governed by
preferential attachment, affecting both internal and external links. However, in contrast with most model predictions the average degree increases in time, and the node separation decreases. Second, we propose a simple model that captures the network’s time evolution. In some limits the model can be solved analytically, predicting a two-regime scaling in agreement with the measurements. Third, numerical simulations are used to uncover the behavior of quantities that could not be predicted analytically. The combined numerical and analytical results underline the important role internal links play in determining the observed scaling behavior and network topology. The results and methodologies developed in the context of the co-authorship network could be useful for a systematic study of other complex evolving networks as well, such as the world wide web, Internet,or other social networks.

Structure of a large social network. Barabashi, et.al, 2003

We study a social network consisting of over 104 individuals, with a degree distribution exhibiting
two power scaling regimes separated by a critical degree kcrit, and a power law relation between
degree and local clustering. We introduce a growing random model based on a local interaction
mechanism that reproduces all of the observed scaling features and their exponents. Our results
lend strong support to the idea that several very different networks are simultenously present in the human social network, and these need to be taken into account for successful modeling

2010年11月7日日曜日

[StreamGraph] Fast Counting of Triangles in Large Real Networks: Algorithms and Laws

http://www.cs.cmu.edu/~ctsourak/tsourICDM08.pdf

Fast Counting of Triangles in Large Real Networks: Algorithms and Laws

How can we quickly nd the number of triangles in a large graph, without actually counting them? Triangles are important for real world social networks, lying at the heart of the clustering coe cient and of the transitivity ratio. However, straight-forward and even approximate counting algorithms can be slow, trying to execute or approximate the equivalent of a 3-way
database join. In this paper, we provide two algorithms, the Eigen-Triangle for counting the total number of triangles in a graph, and the EigenTriangleLocal algorithm that gives the count of triangles that contain a desired node. Additional contributions include the following: (a) We show that both algorithms achieve excellent accuracy, with up to 1000x faster execution time, on several, real graphs and (b) we discover two new power laws ( Degree-Triangle and TriangleParticipa-
tion laws) with surprising properties.

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs

PDF: http://www.cs.cmu.edu/~badityap/papers/eigenspokes-pakdd10.pdf

通話記録からのソーシャルグラフのパターン検知

EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs

We report a surprising, persistent pattern in large sparse social graphs, which we term EigenSpokes. We focus on large Mobile Call graphs, spanning about 186K nodes and millions of calls, and find that
the singular vectors of these graphs exhibit a striking EigenSpokes pattern wherein, when plotted against each other, they have clear, separate lines that often neatly align along specific axes (hence the term “spokes”). Furthermore, analysis of several other real-world datasets e.g., Patent Citations, Internet, etc. reveals similar phenomena indicating this to be a more fundamental attribute of large sparse graphs that is related to their community structure. This is the first contribution of this paper. Additional ones include (a) study of the conditions that lead to such EigenSpokes, and (b) a fast algorithm for spotting and extracting tightly-knit communities, called
SpokEn, that exploits our findings about the EigenSpokes pattern

MapReduce 上のグラフアルゴリズム

Design Patterns for Efficient Graph Algorithms in MapReduce

http://www.umiacs.umd.edu/~jimmylin/publications/Lin_Schatz_MLG2010.pdf

Jimmy Lin, Michael Schatz, University of Maryland
Graphs are analyzed in many important contexts, including ranking search results based on the hyperlink structure of the world wide web, module detection of protein-protein interaction networks, and privacy analysis of social networks. MapReduce provides an enabling technology for large-scale graph processing. However, there appears to be a paucity of knowledge on designing scalable graph algorithms. Existing best practices for MapReduce graph algorithms have significant shortcomings that limit performance, especially with respect to partitioning, serializing, and distributing the graph. We present three design patterns that address these issues and can be used to accelerate a large class of graph algorithms based on message passing, exemplified by PageRank. Experiments show that the application of our design patterns reduces the running time of PageRank on a web graph with 1.4 billion edges by 69%.


Graph Processing with MapReduce
http://horicky.blogspot.com/2010/07/graph-processing-in-map-reduce.html

Twister: Iterative MapReduce

インディアナ大学の Jeffery Fox のグループが作っている Iterative な MapReduce 処理系
Twister http://www.iterativemapreduce.org/

[StreamGraph] Extended Library for Graph Processing

Standard Template Library for XXL Data Sets

http://www.springerlink.com/content/6drvfrkj7uu4hq0a/
We present a software library , that enables practice-oriented experimentation with huge data sets. is an implementation of the C++ standard template library STL for external memory computations. It supports parallel disks, overlapping between I/O and computation, and pipelining technique that can save more than half of the I/Os. has already been used for computing minimum spanning trees, connected components, breadth-first search decompositions, constructing suffix arrays, and computing social network analysis metrics.


Building a Parallel Pipelined External Memory Algorithm Library


Large and fast hard disks for little money have enabled the processing of huge amounts of data on a single machine. For this purpose, the well-established STXXL library provides a framework for external memory algorithms with an easy-to-use interface. However, the clock speed of processors
cannot keep up with the increasing bandwidth of parallel disks, making many algorithms actually compute-bound. To overcome this steadily worsening limitation, we exploit today’s multi-core processors with two new approaches. First, we parallelize the internal computation
of the encapsulated external memory algorithms by utilizing the MCSTL library. Second, we augment the unique pipelining feature of the STXXL, to enable automatic task parallelization.
We show using synthetic and practical use cases that the combination of both techniques increases performance
greatly.

PDF
http://algo2.iti.kit.edu/singler/publications/parallelstxxl-ipdps2009.pdf

Hadoop Summit 2010

http://developer.yahoo.com/events/hadoopsummit2010/agenda.html

2010年11月6日土曜日

[StreamGraph] Max-Cover in Map-Reduce

Max-Cover in Map-Reduce, WWW 2010

http://research.google.com/pubs/pub36582.html

The NP-hard Max-k-cover problem requires selecting k sets from a collection so as to maximize the size of the union. This classic problem occurs commonly in many settings in web search and advertising. For moderately-sized instances, a greedy algorithm gives an approximation of (1-1/e). However, the greedy algorithm requires updating scores of arbitrary elements after each step, and hence becomes intractable for large datasets. We give the first max cover algorithm designed for today's large-scale commodity clusters. Our algorithm has provably almost the same approximation as greedy, but runs much faster. Furthermore, it can be easily expressed in the MapReduce programming paradigm, and requires only polylogarithmically many passes over the data. Our experiments on five large problem instances show that our algorithm is practical and can achieve good speedups compared to the sequential greedy algorithm.

Apache HAMA

http://incubator.apache.org/hama/

Apache Hama is a distributed computing framework based on BSP (Bulk Synchronous Parallel) computing techniques for massive scientific computations (e.g., matrix, graph, network, ..., etc), currently being incubated as one of the incubator project by the Apache Software Foundation.

Paper
http://salsahpc.indiana.edu/CloudCom2010/papers.html

DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining, 2008, ICDM

PDF http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.163.1677&rep=rep1&type=pdf

Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry.

[StreamGraph] PEGASUS (Peta Graph Mining System)

Project URL http://www.cs.cmu.edu/~pegasus/

Paper URL ->
http://www.cs.cmu.edu/~ukang/papers/PegasusICDM2009.pdf

PEGASUS: A Peta-Scale Graph Mining System - Implementation and Observations

In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering,diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines (b) linear running time on thenumber of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with 6,7 billion edges.

2010年11月5日金曜日

MapReduce Online

http://www.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-136.html


MapReduce is a popular framework for data-intensive distributed computing of batch jobs. To simplify fault tolerance, the output of each MapReduce task and job is materialized to disk before it is consumed. In this paper, we propose a modified MapReduce architecture that allows data to be pipelined between operators. This extends the MapReduce programming model beyond batch processing, and can reduce completion times and improve system utilization for batch jobs as well. We present a modified version of the Hadoop MapReduce framework that supports online aggregation, which allows users to see "early returns" from a job as it is being computed. Our Hadoop Online Prototype (HOP) also supports continuous queries, which enable MapReduce programs to be written for applications such as event monitoring and stream processing. HOP retains the fault tolerance properties of Hadoop, and can run unmodified user-defined MapReduce programs.

[StreamScale] ZooKeeper: Wait-free coordination for Internet-scale systems

http://research.yahoo.com/pub/3280

ZooKeeper: Wait-free coordination for Internet-scale systems

Authors:Hunt, P.; Konar, M.; Junqueira, F.P.; Reed, B.
Source: USENIX Annual Technology Conference (2010)

Abstract: In this paper, we describe ZooKeeper, a service for coordinating processes of distributed applications. Since ZooKeeper is part of critical infrastructure, ZooKeeper aims to provide a simple and high performance kernel for building more complex coordination primitives at the client. It incorporates elements from group messaging, shared registers, and distributed lock services in a replicated, centralized service. The interface exposed by Zoo- Keeper has the wait-free aspects of shared registers with an event-driven mechanism similar to cache invalidations of distributed file systems to provide a simple, yet powerful coordination service. The ZooKeeper interface enables a high-performance service implementation. In addition to the wait-free property, ZooKeeper provides a per client guarantee of FIFO execution of requests and linearizability for all requests that change the ZooKeeper state. These design decisions enable the implementation of a high performance processing pipeline with read requests being satisfied by local servers. We show for the target workloads, 2:1 to 100:1 read to write ratio, that ZooKeeper can handle tens to hundreds of thousands of transactions per second. This performance allows ZooKeeper to be used extensively by client applications.

[StreamScale] S4: Distributed Stream Computing Platform

http://s4.io/
http://labs.yahoo.com/node/476

S4: Distributed Stream Computing Platform
Authors: Neumeyer L, Robbins B, Nair A, Kesari A

Abstract: S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.

- PE がアクター(エージェント)となり、各イベントを処理していくモデル
- PE の開発者は、2つのメソッド processEvent と output メソッドを実装する。processEvent は受けたイベントを処理して、出力ストリームに渡す。output メソッドは、外部のコンポーネントから呼び出され、Aggregate のように定期的にバッファリングされたデータに対して処理するようなことができる
- 複数の PE は疎な結合となっており、どの PE とどの PE がつながっているかは、Spring フレームワークの Beans の記述として記述する。PE 同士の接続をアプリケーション側では書かない。
- Hadoop 上で実装されている、というネット上のコメントがあるが、それは全くの誤り。Hadoop 上では実装されていない。
- すべての PE は 対称的で、マスターデーモンのような存在はない。

- S4 のサンプルプログラムでは、Twitter Stream API を用いてトピックの数をカウントするアプリケーションを実装

2010年11月2日火曜日

[StreamGraph] グラフベンチマーク Graph 500

グラフ処理用ベンチマーク。グラフ関係者必見。

Web Page: http://www.graph500.org/

仕様
http://www.graph500.org/spec.html

実装 v1.2
http://www.graph500.org/downloads.html

Data intensive supercomputer applications are increasingly important HPC workloads, but are ill suited for platforms designed for 3D physics simulations. Current benchmarks and performance metrics do not provide useful information on the suitability of supercomputing systems for data intensive applications. A new set of benchmarks is needed in order to guide the design of hardware architectures and software systems intended to support such applications and to help procurements. Graph algorithms are a core part of many analytics workloads. Backed by a steering committee of over 30 international HPC experts from academia, industry, and national laboratories, Graph 500 will establish a set of large-scale benchmarks for these applications. This BOF will unveil the first Graph 500 list, and discuss both today's Graph 500 benchmark and the evolution of that benchmark going forward. The Graph 500 steering committee is in the process of developing comprehensive benchmarks to address three application kernels: concurrent search, optimization (single source shortest path), and edge-oriented (maximal independent set). Further, we are in the process of addressing five graph-related business areas: Cybersecurity, Medical Informatics, Data Enrichment, Social Networks, and Symbolic Networks. The BOF will offer a forum for community and provide a rallying point for data intensive supercomputing problems, and follows the introduction of the Graph 500 at ISC2010. This is the first serious approach to complement the Top 500 with data intensive applications. Additionally, we are working with the SPEC committee to include our benchmark in their CPU benchmark suite. We anticipate the list will rotate between ISC and SC in future years.The Graph 500 was announced at ISC2010 and the first list will appear at SC2010.

2010年11月1日月曜日

[StreamXML] 関連論文

XML query processing using GPU
http://www.kde.cs.tsukuba.ac.jp/~vjordan/docs/

Regular Expression Matching on Graphics Hardware for Intrusion Detection
http://portal.acm.org/citation.cfm?id=1691157

Evaluating GPUs for Network Packet Signature Matching
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.153.4770

iNFAnt: NFA Pattern Matching on GPGPU Devices
http://netgroup.polito.it/Members/niccolo_cascarano/pub/cuda-02-2010.pdf

Research into GPU accelerated pattern matching for applications in computer security
http://www.cosc.canterbury.ac.nz/research/reports/HonsReps/2009/hons_0905.pdf

High-throughput stream categorization and intrusion detection on GPU
http://ieeexplore.ieee.org/Xplore/login.jsp?url=http://ieeexplore.ieee.org/iel5/5550962/5558619/05558648.pdf%3Farnumber%3D5558648&authDecision=-203


Small-ruleset regular expression matching on GPGPUs
http://portal.acm.org/citation.cfm?id=1810130

SQL/XML-IMDBg: A GPU In-Memory Database and Query Co-Processor
http://www.slideshare.net/NVIDIA/1105-gtc09