鈴村グループ (旧・鈴村研究室) : 12月 2010

2010年12月24日金曜日

[StreamGraph] Mining Large Time-evolving Data Using Matrix and Tensor Tools, SIGMOD 20007 tutorial

Mining Large Time-evolving Data Using Matrix and Tensor Tools, SIGMOD 20007 tutorial
http://www.cs.cmu.edu/~christos/TALKS/SIGMOD-07-tutorial/

How can we find patterns in sensor streams (eg., a sequence of temperatures, water-pollutant measurements, or machine room measurements)? How can we mine Internet traffic graph over time? Further, how can we make the process incremental? We review the state of the art in four related fields: (a) numerical analysis and linear algebra (b) multi-linear/tensor analysis (c) graph mining and (d) stream mining. We will present both theoretical results and algorithms as well as case studies on several real applications. Our emphasis is on the intuition behind each method, and on guidelines for the practitioner.

[StreamGraph] 既存のインクリメンタルグラフマイニング

1. An incremental frequent structure mining framework for real-time alert correlation, 2009
2. Incremental Pattern Discovery on Streams, Graphs and Tensors, 2007

2010年12月23日木曜日

[StreamGraph] インクリメンタルな Proximity 計算

Fast Incremental Proximity Search in Large Graphs, ICML 2008
http://portal.acm.org/citation.cfm?id=1390269

In this paper we investigate two aspects of ranking problems on large graphs. First, we augment the deterministic pruning algorithm in Sarkar and Moore (2007) with sampling techniques to compute approximately correct rankings with high probability under random walk based proximity measures at query time. Second, we prove some surprising locality properties of these proximity measures by examining the short term behavior of random walks. The proposed algorithm can answer queries on the fly without caching any information about the entire graph. We present empirical results on a 600, 000 node author-word-citation graph from the Citeseer domain on a single CPU machine where the average query processing time is around 4 seconds. We present quantifiable link prediction tasks. On most of them our techniques outperform Personalized Pagerank, a well-known diffusion based proximity measure.

[StreamGraph] Incremental PageRank 関連論文

Incremental PageRank for Twitter Data Using Hadoop
http://homepages.inf.ed.ac.uk/miles/msc-projects/abdullah.pdf

Fast Incremental and Personalized PageRank
http://www.citeulike.org/user/seeding-films/article/7795698

2010年12月19日日曜日

SYSTOR 2011

https://www.research.ibm.com/haifa/conferences/systor2011/

ACM KDD 2011

http://www.kdd.org/kdd2011/

2011/8 SanDiego, US

[StreamGraph] RWR の並列化・高速化

　ノード間の関連度を計算する手法　Random Walk With Restarts は　PEGASUS の１アプリケーションとしても実装されており、ソースコードにも入っています。

　また、RWRのGPUによる高速化が以下の論文で発表されています。

「Fast Mining Algorithms of Graph data on GPUs」,KDD LDMTAS2010 (KDD2010のワークショップ）, Xintian Yang (Ohio State University) (PDF)

2010年12月16日木曜日

[StreamGraph] 本プロジェクトの方向性

　本プロジェクトの方向性だが、時間的に動的変化するネットワーク（グラフ）に対するリアルタイム処理を実現する処理基盤、として体系化する方向性に行くべきだということは、ネットワーク・グラフ屋の論文を調べていくうちに確信が深くなっている。
あえて、上記で、「リアルタイム性」という表現を使って、データストリーム処理というキーワードを使わなかったかだが、それはGIM-Vの計算モデルを、GPUのようなアクセラレータを用いて既存のバッチ処理モデルを超並列処理してリアルタイム性を実現する手法、なども考えられるからだ。また、データストリーム処理というコンテキストで言えば、（１）インクリメンタルな差分更新による処理の効率化を実現する手法、（２）ヒューリスティックス（例：雁瀬くんの二次の隔たりの特性を生かしたグラフ分割と処理の分散）を用いた性能最適化、（３）メモリに載りきらないようなグラフストリームが到着した際に Persistent な分散ファイルシステムと協調し階層的にグラフ構造を格納、頻繁にアクセスされる部分グラフのみをオンメモリ上に確保するといった最適化手法、そしてそれらのグラフデータに対して開発者からシームレスにアクセスできるような処理基盤、と言った研究テーマが考えられるでしょう。
上記の研究テーマを遂行するにあたって、まずはとっかかりとして、バッチ処理の統一的な処理モデルとして提案されているGIM-Vおよびそのリファレンス実装であるPEGASUSを軸に、上記テーマの実現手法を考えていきましょう。例えば、このブログの前の記事に Incremental PageRank の論文を載せましたが、それをPEGASUS上（もしくは System S 上で）で実装してみて、GIM-V,プログラミングモデル、処理系の gap を分析し、上記（２）や（３）（（１）もできれば）を自然に表現できるモデルを提案できると良い研究成果につながると思います。

[StreamGraph] On the Bursty Evolution of Blogspace

On the Bursty Evolution of Blogspace, WWW 2005

http://www.springerlink.com/content/j15331281h531wmv/

[StreamGraph] 時系列動的グラフの視覚化

Interactive Visualization of Dynamic Social Network (PDF)

2010年12月15日水曜日

[StreamGraph] 時間的に変化する動的グラフに関する研究

Temporal Evolution of Communities in the Enron Email Data Set

Mining Temporally Evolving Graphs, 2004

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.103.6282

>Web mining has been explored to a vast degree and different techniques have been proposed for a variety of applications that include Web Search, Web Classification, Web Personalization etc. Most research on Web mining has been from a ‘data-centric ’ point of view. The focus has been primarily on developing measures and applications based on data collected from content, structure and usage of Web till a particular time instance. In this project we examine another dimension of Web Mining, namely temporal dimension. Web data has been evolving over time, reflecting the ongoing trends. These changes in data in the temporal dimension reveal new kind of information. This information has not captured the attention of the Web mining research community to a large extent. In this paper, we highlight the significance of studying the evolving nature of the Web graphs. We have classified the approach to such problems at three levels of analysis: single node, sub-graphs and whole graphs. We provide a framework to approach problems of this kind and have identified interesting problems at each level. Our experiments verify the significance of such analysis and also point to future directions in this area. The approach we take is generic and can be applied to other domains, where data can be modeled as graph, such as network intrusion detection or social networks.

評価の重要さ

　論文としての完成度を高めるには、徹底的に、提案する手法を定量的かつ定性的に評価する必要があります。良い研究であっても、その辺の Evidence がないと論文としては難しく、逆に、国際学会のいろいろな論文を見ると、提案する手法の新規性や進歩性が低くても、その辺の徹底的な分析があることによって、採択されている論文もあることに気づくでしょう。実装に比べると、評価は地道な作業ですが、そこが論文としては非常に重要であることを覚えておいてほしいと思います。

[StreamGraph] インクリメンタルなネットワーク分析手法

Incremental Page Rank Computation on Evolving Graphs (PDF), WWW 2005 ,

In this paper, we propose a method to incrementally compute PageRank for a large graph that is evolving. Our approach is quite general, and can be used to incrementally compute (on evolving graphs) any metric that satisfies the first order Markov property. Categories and Subject Descriptors

2010年12月14日火曜日

[Data Distribution] 投稿予定

データ分散最適化に関する研究の投稿先

HPDC 2011

Technical Papers Due:	17 January 2011
PAPER DEADLINE EXTENDED:	24 January 2011 at 12:01 PM (NOON) Eastern Time
Author Notifications:	28 February 2011

http://www.hpdc.org/2011/cfp.php

SYSTOR 2011

Paper Submission:	January 31, 2011
Paper Submission to highlights track:	March 4, 2011
Paper Notification:	March 21, 2011

https://www.research.ibm.com/haifa/conferences/systor2011/

[StreamCloud] 学会投稿予定

以下、StreamCloud の学会投稿予定

IEEE Cloud 2011 (http://www.thecloudcomputing.org/2011/cfp.html)

- Submission: January 20 (abstract) and 31, 2011

- Notification : March 15, 2011

IEEE CloudCom 2011 (3rd IEEE International Conference on Cloud Computing Technology and Science)

- Submission : August 21, 2011,

- Notification: September 26, 2011

[StreamGraph] Charactering Betweenness Centrality Algorithm on Multi-core Algorithms

Charactering Betweenness Centrality Algorithm on Multi-core Algorithms (PDF) , ISPA 2009

Evaluating Centrality Metrics in Real-World Networks on GPU (PDF)

[StreamGraph] Graph 500 Update

Graph500 のリファレンスコードのバージョン 1.2 が http://www.graph500.org/Reference.html に公開されている。実装は、sequential, MPI, OpenMP版（あと Cray XMT 用のコードもある）である。

入力データは、generator があり、クロネッカーグラフ (Jure の Kronecker Graph Tutorial, '08) を生成.処理はBFS( 幅優先探索)を実行。

BFS の実装自体はカスタマイズ可能。

2010年12月10日金曜日

[StreamGraph] SNAP (Stanford Network Analysis Package)

C++ベースのネットワーク分析ライブラリ: http://snap.stanford.edu/

スタンフォード大 Jure LesKovec 氏(コーネル大 Jon Kleinbergや PEGASUS の Christos Faloutsos 氏との元共同研究者）が管理.

Jure 氏の Ph.D Defense Talk : Dynamics of Large Network http://videolectures.net/sep08_leskovec_tdef/

[StreamGraph] Jon Kleinberg's Survey

http://www.cs.cornell.edu/home/kleinber/

時系列変化する動的グラフに関する参考文献 (http://www.cs.cornell.edu/Courses/cs6850/2008fa/より)

The Time Axis

Information networks are highly dynamic, but it is often hard to form a good picture of how they are evolving along their ``time axis.'' Temporal change spans many orders of magnitude, from the second-by-second dynamics of usage data to the year-by-year shifts in the global structure.

Survey Paper

J. Kleinberg. Temporal Dynamics of On-Line Information Streams. Draft chapter for the forthcoming book Data Stream Management: Processing High-Speed Data Streams (M. Garofalakis, J. Gehrke, R. Rastogi, eds.), Springer.

Short Time Scales: Usage Data and Bursty Dynamics

L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In Proc. IEEE, Vol. 77, No. 2, pp. 257-286, Feb. 1989

J. Allan, J.G. Carbonell, G. Doddington, J. Yamron, Y. Yang, Topic Detection and Tracking Pilot Study: Final Report. Proc. DARPA Broadcast News Transcription and Understanding Workshop, Feb. 1998.

R. Swan, J. Allan, Automatic generation of overview timelines. Proc. SIGIR Intl. Conf. on Research and Development in Information Retrieval, 2000.

S. Havre, B. Hetzler, L. Nowell, ThemeRiver: Visualizing Theme Changes over Time. Proc. IEEE Symposium on Information Visualization, 2000.

J. Kleinberg. Bursty and Hierarchical Structure in Streams. Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2002.

J. Aizen, D. Huttenlocher, J. Kleinberg, A. Novak. Traffic-Based Feedback on the Web. Proceedings of the National Academy of Sciences 101(Suppl.1):5254-5260, 2004.

R. Kumar, J. Novak, P. Raghavan, A. Tomkins. On the bursty evolution of Blogspace. Proc. International WWW Conference, 2003.

Y. Zhu and D. Shasha. Efficient Elastic Burst Detection in Data Streams. Proc. ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2003.

E. Gabrilovich, S. Dumais, E. Horvitz. NewsJunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty. Proceedings of the Thirteenth International World Wide Web Conference. May 2004.

M. Vlachos, C. Meek, Z. Vagena, D. Gunopulos. Identifying Similarities, Periodicities and Bursts for Online Search Queries. Proc. ACM SIGMOD International Conference on Management of Data, 2004.

A..-L. Barabasi. The origin of bursts and heavy tails in human dynamics. Nature 435, 207-211 (2005).

Micah Dubinko, Ravi Kumar, Joseph Magnani, Jasmine Novak, Prabhakar Raghavan, Andrew Tomkins. Visualizing Tags over Time. WWW2006 Conference. See also the demo of flickr tag visualization.

Xuerui Wang and Andrew McCallum. Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Conference on Knowledge Discovery and Data Mining (KDD) 2006.

Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat, Mining Correlated Bursty Topic Patterns from Coordinated Text Streams. Proceedings of the 2007 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'07 ), pages 784-793.

Temporal Analysis and Bursty Phenomena

J. Leskovec, L. Backstrom, J. Kleinberg. Meme-tracking and the dynamics of the news cycle. Proc. 15th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2009.
- See also the commpanying MemeTracker site, which uses the ideas in this paper to visualize the news cycle.
L. Backstrom, J. Kleinberg, R. Kumar. Optimizing Web traffic via the Media Scheduling Problem. Proc. 15th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2009.
J. Kleinberg. Temporal Dynamics of On-Line Information Streams. In Data Stream Management: Processing High-Speed Data Streams, (M. Garofalakis, J. Gehrke, R. Rastogi, eds.), Springer, 2004.
J. Aizen, D. Huttenlocher, J. Kleinberg, A. Novak. Traffic-Based Feedback on the Web. Proceedings of the National Academy of Sciences 101(Suppl.1):5254-5260, 2004. (Also available in pre-print form.)
J. Kleinberg. Bursty and Hierarchical Structure in Streams. Proc. 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2002.
- See also some sample results from the burst detection algorithm described in this paper.

J. Leskovec, J. Kleinberg, C. Faloutsos. Graphs over Time: Densification Laws, Shrinking Diameters and Possible Explanations. Proc. 11th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, 2005.

[StreamGraph] Enron の E-mail データ

http://www.cs.cmu.edu/~enron/

[StreamGraph] スタンフォード大の公開データ

http://snap.stanford.edu/data/

公開しているデータは以下の通り。

Social networks: online social networks, edges represent interactions between people

Communication networks: email communication networks with edges representing communication

Citation networks: nodes represent papers, edges represent citations

Collaboration networks: nodes represent scientists, edges represent collaborations (co-authoring a paper)

Web graphs: nodes represent webpages and edges are hyperlinks

Amazon networks : nodes represent products and edges link commonly co-purchased products

Internet networks : nodes represent computers and edges communication

Road networks : nodes represent intersections and edges roads connecting the intersections

Autonomous systems : graphs of the internet

Signed networks : networks with positive and negative edges (friend/foe, trust/distrust)

Wikipedia networks and metadata : Talk, editing and voting data from Wikipedia

Twitter and Memetracker : Memetracker phrases, links and 467 million Tweets

[StreamGraph] Facebook Data

カリフォルニア大学の研究グループがクロールした Facebook のデータを公開

http://odysseas.calit2.uci.edu/wiki/doku.php?id=public:online_social_networks#datasets1

[StreamGraph] Livedoor Data

トップページがいかがわしいが、Livedoor が研究用途にデータを提供しているようだ。

http://labs.edge.jp/datasets/

[StreamGraph] データストリームに対する相関ルールを用いたコミュニティの時系列解析

データストリームに対する相関ルールを用いたコミュニティの時系列解析

https://kaigi.org/jsai/webprogram/2010/pdf/25.pdf

実験では、Livedoor が提供するソーシャルブックマークのデータを用いている

http://wiki.livedoor.jp/staff_clip/d/FrontPage

Body sensor data processing using stream computing

MIR '10 Proceedings of the international conference on Multimedia information retrieval

Advances in sensor technologies have accelerated the instrumentation of medical institutions. Today, modern intensive care units use sophisticated patient monitoring systems able to produce massive amounts of physiological streaming data. While these monitoring systems aim at improving patient care and staff productivity, they have the potential of introducing a data explosion problem. We address this problem by developing an open infrastructure upon which healthcare analytics can be built, managed, and deployed to analyze in real time physiological streaming data and turn this data into meaningful information for medical professionals. This infrastructure incorporates feature extraction and data mining functionalities for the discovery of clinical rules capable of identifying medically significant events. The system is based on a state of the art stream computing middleware. This paper presents this infrastructure from a programming model perspective. An exemplar application for arrhythmia detection is also described to illustrate its capabilities.

http://portal.acm.org/citation.cfm?id=1743465&dl=ACM&coll=DL&CFID=1325502&CFTOKEN=96177935

2010年12月9日木曜日

[StreamGraph] Some datasets available on the web

http://www.datawrangling.com/some-datasets-available-on-the-web

2010年12月8日水曜日

[StreamGraph] 食べログ

食べログ（http://u.tabelog.com/）には、１日に２００回までという制限がありますが、API があるようです。チェックしてみてください。２部グラフだと、グルメレビュアーとレストランとなるでしょう。グルメレビュアーは２０万人弱、レストランは６１万店舗となります。

[StreamGraph] Mixi Graph API

Mixi Graph API について Mixi に問い合わせたところ、直接お電話を頂いた。Graph API はパートナーとなる企業のみが使用できるのだが、どうも、それさえも、やはり、すべてのデータをクロールできるわけではないらしい。また、取得したデータは24時間以内には削除しなければならないのだそうだ。

今後、共同研究のような形で一緒にできるかどうかを話し合う機会がありそうだが、昨今の事情を考えると、日本のネット系の企業はあまり積極的ではない。WWWなどのウェブ系では著名な学会では、Facebook , Google, Flicker, Amazon などの実データを使った研究がたくさん出ているが、日本はとにかくそのあたり、消極的だ。米国を中心に、この手のネット企業が学会でプレゼンスを出すことにより、より Visibility を上げ、更には優秀な技術者が入ってくれるという良い循環となっている。

日本でも、そのように産学一体となって、この周辺の研究を加速化させ、Win-Win の関係を構築することが大変重要だと思うのだが。。。

[StreamGraph] Facebook App

Facebook の APIの手とのため、StreamGraph プロジェクト用にアプリケーションを登録。

http://developers.facebook.com/docs/authentication/#client_credentials

以下でアクセストークンを取得できる。your_app_id, your_app_secret は必要に応じて教えます。

curl -F grant_type=client_credentials \      -F client_id=your_app_id \      -F client_secret=your_app_secret \      https://graph.facebook.com/oauth/access_token

上記コマンドを発行すると、access_token が返ってくる。

[StreamGraph] Facebook Graph API

Facebook Graph API を用いれば実データは取れるそうだ。

------

The announcement was glossed over during Facebook’s many launches at f8 yesterday, but the company has introduced ways for developers to conduct a variety of searches for content in its Graph API. This could benefit any number of search companies and other developers who are looking for lots of public data to aggregate and organize.

The Open Graph allows applications, Pages, web sites and other software services to add Facebook features, like the “Like” button, to their own sites. Taking actions on other sites results in those actions being shared with your friends on Facebook, and may allow friends on those sites to see what you’re doing, too.

Many users have their privacy settings configured to allow status updates, likes and other activity to be public. Its terminology for this data is “public objects.” The generic search format is: https://graph.facebook.com/search?q=QUERY&type=OBJECT_TYPE.

Here are the specific types of searches that can be done:

All public posts: https://graph.facebook.com/search?q=watermelon&type=post

People: https://graph.facebook.com/search?q=mark&type=user

Pages: https://graph.facebook.com/search?q=platform&type=page

Events: https://graph.facebook.com/search?q=conference&type=event

Groups: https://graph.facebook.com/search?q=programming&type=group

News feeds of individuals can be searched, too, but you’ll need authorization. Here’s an example:

News Feed: https://graph.facebook.com/me/home?q=facebook

This search is restricted to public data about users and their immediate friends (not, say, their friends of friends), in keeping with Facebook’s existing data-sharing policies.

2010年12月7日火曜日

[StreamGraph] グラフアルゴリズムの GPU 化

Fast Mining Algorithms of Graph data on GPUs, LDMTA 2010

Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel approach to data representation for computing this kernel, particularly targeting sparse matrices representing power-law graphs. Using real data, we show how our representation scheme, coupled witha novel tiling algorithm, can yield signi ficant bene ts over the current state of the art GPU and CPU e orts on a number of core data mining algorithms such as PageRank, HITS and Random Walk with Restart.

http://arnetminer.org/confs/LDMTA2010/KDD-LDMTA-YangXintian.pdf

Accelerating Large Graph Algorithms on the GPU Using CUDA, 2007

http://www.springerlink.com/content/y4816x2q7475v93n/

Large graphs involving millions of vertices are common in many practical applications and are challenging to process. Practical-time implementations using high-end computers are reported but are accessible only to a few. Graphics Processing Units (GPUs) of today have high computation power and low price. They have a restrictive programming model and are tricky to use. The G80 line of Nvidia GPUs can be treated as a SIMD processor array using the CUDA programming model. We present a few fundamental algorithms – including breadth first search, single source shortest path, and all-pairs shortest path – using CUDA on large graphs. We can compute the single source shortest path on a 10 million vertex graph in 1.5 seconds using the Nvidia 8800GTX GPU costing $600. In some cases optimal sequential algorithm is not the fastest on the GPU architecture. GPUs have great potential as high-performance co-processors.

A Fast GPU Algorithm for Graph Connectivity, IPDPS 2010

http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470817

raphics processing units provide a large computational power at a very low price which position them as an ubiquitous accelerator. General purpose programming on the graphics processing units (GPGPU) is best suited for regular data parallel algorithms. They are not directly amenable for algorithms which have irregular data access patterns such as list ranking, and finding the connected components of a graph, and the like. In this work, we present a GPU-optimized implementation for finding the connected components of a given graph. Our implementation tries to minimize the impact of irregularity, both at the data level and functional level. Our implementation achieves a speed up of 9 to 12 times over the best sequential CPU implementation. For instance, our implementation finds connected components of a graph of 10 million nodes and 60 million edges in about 500 milliseconds on a GPU, given a random edge list. We also draw interesting observations on why PRAM algorithms, such as the Shiloach-Vishkin algorithm may not be a good fit for the GPU and how they should be modified.

2010年12月3日金曜日

[StreamGraph] Mixi Graph API

Mixi Graph API に関しては、パートナー契約が必要だそうで、登記簿などの提出が必要だそうです。なので、Mixi に大学などの研究組織に関しては、より手続きの簡素化ができるかどうか問い合わせをしておきました。

Mixi Graph API http://developer.mixi.co.jp/connect/mixi_graph_api

インターネットコンテンツの扱いに関して

著作権法改正案が国会提出、「ダウンロード違法化」など盛り込む

http://internet.watch.impress.co.jp/cda/news/2009/03/11/22747.html

文部科学省著作権法の一部を改正する法律案

http://www.mext.go.jp/b_menu/houan/an/171/1251917.htm

文化庁のオフィシャルページ

http://www.bunka.go.jp/chosakuken/21_houkaisei.html

[StreamGraph] Netflix Challenge

ビデオレンタル会社 Netflix が依然実施したコンテスト (Netflix Prize: リコメンデーションの精度を10％以上向上させたら賞金) により、この分野における研究は大きく前進した。当然、NetFlix は次のコンテストを企画していたそうだったが、個人を一意に特定できないことを証明することは難しく、プライバシーの問題でキャンセルしたそうだ。

http://www.techdirt.com/articles/20100312/1148468542.shtml