2012年11月9日金曜日

Location Prediction from Tweets

http://infolab.cse.tamu.edu/static/papers/cikm1184c-cheng.pdf
http://www.hpl.hp.com/research/scl/papers/socialmedia/socialmedia.pdf
http://doras.dcu.ie/16754/1/nohare_paper.pdf

2012年9月17日月曜日

Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation

Spectral Analysis for Billion-Scale Graphs: Discoveries and Implementation
http://www.cs.cmu.edu/~ukang/papers/HeigenPAKDD2011.pdf

Ogata-kun's work should refer the above effort ...

2012年8月13日月曜日

Origin-Destination Data Generation

Deriving origin–destination data from a mobile phone network
http://www.esi2.us.es/GT/docs/iet_art1.pdf

DYNAMIC ORIGIN-DESTINATION MATRIX ESTIMATION AND PREDICTION FOR REAL- TIME TRAFFIC MANAGEMENT SYSTEMS
http://trid.trb.org/view.aspx?id=640422

Time-Varying Network Tomography : Router Link Data
ftp://cs.bell-labs.com/cm/stat/doc/NetTomography.pdf

Development of Revised Methodology for Collecting Origin-Destination Data
http://www.dot.state.fl.us/research-center/Completed_Proj/Summary_PL/FDOT_BD544_30_rpt.pdf


Real-Time Estimation of Origin-Destination Matrices with Partial Trajectories from Electronic Toll Collection Tag Data

http://pubsindex.trb.org/view.aspx?id=775157


Dynamic Origin–Destination Demand Estimation Using Automatic Vehicle Identification Data

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1603556&userType=inst


Distributed Approach for Estimation of Dynamic Origin—Destination Demand
http://trb.metapress.com/content/96400l51428r4737/

2012年8月9日木曜日

Gephi

Gephi (http://gephi.org/)


Gephi is an open-source and multiplatform software distributed under the dual license CDDL 1.0 and GNU General Public License v3.
Layout algorithms give the shape to the graph. Gephi provides state-of-the-art algorithms layout algorithms, both for efficiency and quality. The Layout palette allows user to change layout settings while running, and therefore dramatically increase user feedback and experience: Force-based algorithms or Multi-level algorithms (graph coarsening). 

Video demonstration : 
https://gephi.org/videos/

Time-evolving network visualization

Hashikawa-kun will be working on the time-evolving twitter network since 2006.

The reference demo might be as follows :

Egyptian Revolution on Twitter
https://gephi.org/tag/dynamics/

http://www.youtube.com/watch?v=9guRNGiNBAg
http://www.youtube.com/watch?v=KKNxulf7RNg&feature=relmfu
http://www.youtube.com/watch?v=Hos4m3iO108&feature=related
http://www.youtube.com/watch?v=hI-7wteRv7E
http://computationallegalstudies.com/2010/01/06/the-time-evolving-structure-of-the-gawaher-islamic-forum-as-experienced-by-umar-farouk-abdulmutallab-the-christmas-day-bomber/
http://www.youtube.com/watch?v=a7uTKwbsFtg&feature=related


Not relevant but still interesting visualization:
http://www.evolutionoftheweb.com/?hl=ja#/growth/day




2012年8月2日木曜日

Research on Microblog services / Twitter

What is Twitter, a Social Network or a News Media?http://an.kaist.ac.kr/~haewoon/papers/2010-www-twitter.pdf

A geographic study of tie strength in social media
http://dl.acm.org/citation.cfm?id=2063959&CFID=100588297&CFTOKEN=89581781

Outtweeting the Twitterers - Predicting Information Cascades in Microblogs
http://static.usenix.org/event/wosn10/tech/full_papers/Galuba.pdf

Why Rumors Spread Fast in Social Networks
http://www.mpi-inf.mpg.de/~tfried/paper/2012CACM.pdf

Measuring Behavioral Trust in Social Network
http://www.cs.rpi.edu/research/pdf/10-03.pdf

Time-Based Sampling of Social Network Activity Graphs
http://www.cs.purdue.edu/homes/neville/papers/ahmed-mlg2010.pdf


Massive Social Network Analysis: Mining Twitter for Social Good

http://www.cc.gatech.edu/~bader/papers/MassiveTwitter-ICPP2010.pdf


Understanding Latent Interactions in Online Social Networks

http://cs.ucsb.edu/~Ravenben/publications/pdf/renren-imc10.pdf


Facebook

As I introduced it a bit at yesterday's seminar, the crawled data set for Facebook is downloadable from the following URL.
http://suzumura-lab.blogspot.jp/2010/12/streamgraph-facebook-data.html

2012年7月11日水曜日

Status

雁瀬君 -> 坂本君が作っている X10 ベースの GIM-V 処理系をベースに スケーラブルなIncremental GIM-V 処理系を作る
上野君 -> general model for graph processing, SSSP for 8/8
渡部君 -> Giraph on TSUBAME, 性能分析
岡田君-> 感情分類器を作成中
小形君 -> ParGraph2012 の論文に着手
Charuwat 君 -> RDF data に対するグラフ解析の検討。BCを ParGraphに出す
橋川君->Twitter REST API を用いてクローラ開発






Important finding on Twitter REST API

Important finding on Twitter REST API
https://dev.twitter.com/docs/api

2012年6月21日木曜日

Clearing the clouds: a study of emerging scale-out workloads on modern hardware


Clearing the clouds: a study of emerging scale-out workloads on modern hardware

http://dl.acm.org/citation.cfm?id=2150982

Graph related stuff

Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computationhttp://osl.iu.edu/publications/prints/2005/Gregor:OOPSLA:2005.pdf



Efficient Parallel Graph Exploration on Multi-Core CPU and GPU, PACT2011

http://ppl.stanford.edu/papers/pact11-hong.pdf



Accelerating CUDA Graph Algorithms at Maximum Warp, PPOP2007

http://ppl.stanford.edu/papers/ppopp070a-slides.pdf



Efficient Breadth-First Search on the Cell/B.E. Processor
http://www.dais.unive.it/~calpar/AA07-08/bfs.pdf



Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning, PARA08
http://www.sandia.gov/~egboman/papers/PARA08.pdf

Fast sparse matrix-vector multiplication by partitioning and reordering
http://igitur-archive.library.uu.nl/dissertations/2011-0913-201603/UUindex.html

More introduction stuff
http://www.cs.berkeley.edu/~skamil/cs267/notes/lect15NoteSpMV_kim.pdf

CACHE-OBLIVIOUS SPARSE MATRIX–VECTOR MULTIPLICATION BY
USING SPARSE MATRIX PARTITIONING METHODS

http://people.cs.kuleuven.be/~albert-jan.yzelman/PDFs/yzelman09-rev.pdf



Parallel Hypergraph Partitioning for Scientific Computing

http://www.cs.sandia.gov/~egboman/papers/IPDPS06.pdf



ON TWO-DIMENSIONAL SPARSE MATRIX PARTITIONING:
MODELS, METHODS, AND A RECIPE∗, 2010

http://graal.ens-lyon.fr/~bucar/papers/ucca2D.pdf


Application to protein-protein interaction network. 

A. Vazquez, A. Flammini, A. Maritan, and A. Vespignani.
Global protein function prediction in protein-protein interaction network
http://www.ncbi.nlm.nih.gov/pubmed/12740586


Application to computational phylogenetics 

B. Moret, D. Bader, and T. Warnow. High-performance algorithm engineering for computational phylogenetics. In
Proc. Int’l Conf. on Computational Science, volume 2073–
2074 of Lecture Notes in Computer Science, San Francisco,
CA, 2001. Springer-Verlag.


B. M. Moret, D. Bader, T. Warnow, S. Wyman, and M. Yan.
GRAPPA: a high-performance computational tool for phylogeny reconstruction from gene-order data. In Proc.
Botany, Albuquerque, NM, Aug. 2001.

2012年6月19日火曜日

Machine-Learning with Real-time and Streaming Applications

http://lyra.berkeley.edu/CDIConf/program.html

2012年6月12日火曜日

メモリ消費量を考慮したジョブスケジューリング

2年ほど前に書いた以下のエントリの問題は現実に必要な技術。
http://suzumura-lab.blogspot.jp/2010/07/blog-post_28.html

Apache Mahout

Machine Learning の Hadoop ベースのライブラリ Apache Mahout http://mahout.apache.org/
サポートされているアルゴリズム。
  • Collaborative Filtering
  • User and Item based recommenders
  • K-Means, Fuzzy K-Means clustering
  • Mean Shift clustering
  • Dirichlet process clustering
  • Latent Dirichlet Allocation
  • Singular value decomposition
  • Parallel Frequent Pattern mining
  • Complementary Naive Bayes classifier
  • Random forest decision tree based classifier
  • High performance java collections (previously colt collections)
  • A vibrant community

各自のタスク

各自のタスクの概要は以下の通りです。

Miyuru君: X10 Workshop @ Beijing 発表。SOCC2012投稿。ScaleGraphプロジェクトリード
上野君:HPDC発表準備/発表, ISC参加。ACS論文誌発表. 青柳君 X10 ベース ストリーム処理系ヘルプ、大規模グラフ処理基盤アーキテクチャ設計.
雁瀬君:WISE論文執筆。I-GIMV の追求. Apache Giraph 上への実装
渡部君:次の研究テーマ決め。Apache Giraph, HAMA, HBase などの実装。関連論文読み。I-GIMV論文読みなど
岡田君:イベント検知+感情・評判分析を用いたストリーム処理基盤のアーキテクチャおよびコンポーネントの実装詳細設計
Charuwat君:BC最適化実装。RDF + DBPedia、グラフパターンマッチング調査
橋川君:ソーシャルネットワークのシミュレーション基盤構築。6月は Twitter のクローラー実装。
小形君:Spectral Clustering 。MAGMA による高速化。他のクラスタリングアルゴリズムの調査
バオ君:Apache Giraph の実装中身、サンプルアプリケーション、TSUBAME 上での性能評価など
金刺君:大規模エージェントシミュレーションのメッセージング機能の実装。データ同化 (Data Assimilation) が研究テーマ。


2012年6月9日土曜日

Pregel のオープンソース実装

バオ君の次の課題として、Pregel のオープンソース実装である Apache Giraph (http://incubator.apache.org/giraph/)を試してもらっている。Web ページを見ると、Fault Tolerancy などを鑑みて Hadoop/Zookeeper 上に実装されているようだ。性能面では ?? な実装である。

ちなみに、前にブログでも紹介したことがあるが、BSP(Bulk Synchronous Parallel)を実装している Apache Hama (http://incubator.apache.org/hama/)というプロジェクトもあるが、それとは関係ない。こちらはHDFS は使っているようだ。

また、Pregel のもう一つの実装である Bagel というのもある。以下の記事が参考になる。
http://d.hatena.ne.jp/smly/20110730/1312022963

2012年5月30日水曜日

Visualization for Large-scale Traffic Simulation

http://www.trsp.net/cow/video_trafficmixer.html

Visualizing the Kyoto road network by Prof.Hattori
http://www.youtube.com/watch?v=qYJLVZvvKR4&feature=relmfu
http://www.youtube.com/watch?v=I2vvIA9sKRo

Google Earth + KML による視覚化。いまいち。
http://www.youtube.com/watch?v=FEMOpMtenvE
http://www.youtube.com/watch?NR=1&feature=endscreen&v=bkR2ZcLHufI

Visualization with Google Earth
http://www.gtrek.co.uk/GTrek-Flight.kmz

2012年5月14日月曜日

IPDPS 2012 におけるグラフ処理に関する論文

Regular Track に8本もグラフに関する論文が採択されているところを見ると、大規模グラフに関する研究が近年ますます盛んになっている証拠である。http://www.ipdps.org/ipdps2012/2012_advance_program.html


Fast and Efficient Graph Traversal Algorithm for CPUs : Maximizing Single-Node Efficiency
Jatin Chhugani (Intel Corporation, USA); Nadathur Satish (Intel Corporation, USA); Changkyu Kim (Intel Corporation, USA); Jason Sewall (Intel Corporation, USA); Pradeep Dubey (Intel Corporation, USA)
SAHAD: Subgraph Analysis in Massive Networks Using Hadoop
Zhao Zhao (Virginia Tech, USA); Guanying Wang (Virginia Tech, USA); Ali R Butt (Virginia Tech., USA); Maleq Khan (Virginia Tech, USA); Vullikanti S Anil Kumar (Virginia Tech, USA); Madhav Marathe (Virginia Tech, USA)
Accelerating nearest neighbor search on manycore systems
Lawrence Cayton (Max Planck Institute for Intelligent Systems, Germany)
Optimizing large-scale graph analysis on multithreaded, multicore platforms
Guojing Cong (IBM T.J. Watson Research Center, USA)

Multi-core spanning tree algorithms using the disjoint-set data structure
Fredrik Manne (University of Bergen, Norway); Md. Mostofa Ali Patwary (Northwestern University, USA); Peder Refsnes (University of Bergen, Norway)
Graph Partitioning for Reconfigurable Topology
Deepak Ajwani (University College Cork, Ireland); Shoukat Ali (Dublin Research Lab, IBM, USA); John P. Morrison (University Cork, Ireland)
Multithreaded Clustering for Multi-level Hypergraph Partitioning
Umit V. Catalyurek (The Ohio State University, USA); Mehmet Deveci (The Ohio State University, USA); Kamer Kaya (The Ohio State University, USA); Bora Ucar (CNRS, France)              
Multithreaded Algorithms for Maximum Matching in Bipartite Graphs
Ariful Azad (Purdue University, USA); Mahantesh Halappanavar (Pacific Northwest National Laboratory, USA); Sivasankaran Rajamanickam (Sandia National Laboratories, USA); Erik G. Boman (Sandia National Laboratories, USA); Arif Khan (Purdue University, USA); Alex Pothen (Purdue University, USA)

2012年5月10日木曜日

情報拡散に関する研究

Information Transfer in Social Media, WWW 2012
Greg Ver Steeg, Aram Galstyan

The Role of Social Networks in Information Diffusion, WWW 2012
Eytan Bakshy, Itamar Rosenn, Cameron Marlow, Lada Adamic

Recommendations to Boost Content Spread in Social Networks, WWW 2012
Vineet Chaoji, Sayan Ranu, Rajeev Rastogi, Rushi Bhatt


Differences in the Mechanics of Information Diffusion Across Topics: Idioms, Political Hashtags, and Complex Contagion on Twitter, WWW 2011
http://www.cs.cornell.edu/home/kleinber/www11-hashtags.pdf

Limiting the Spread of Misinformation in Social Networks, WWW 2011 
Ceren Budak, Divyakant Agrawal and Amr El Abbadi

Information Credibility on Twitter, WWW, 2011 
Carlos Castillo, Marcelo Mendoza and Bárbara Poblete

Information Spreading in Contex, WWW 2011

Information diffusion in online social networks

Survey Survey on Information Diffusion, 2008

Modeling Information Diffusion in Implicit Networks

Information Diffusion Through Blogspace, WWW2004
http://people.csail.mit.edu/dln/papers/blogs/idib.pdf
We study the dynamics of information propagation in environments of low-overhead personal publishing, using a large collection of weblogs over time as our example domain. We characterize and model this collection at two levels. First, we present a macroscopic characterization of topic propagation through our corpus, formalizing the notion of long-running “chatter” topics consisting recursively of “spike” topics generated by outside world events, or more rarely, by resonances within the community. Second, we present a microscopic characterization of propagation from individual to individual, drawing on the theory of infectious diseases to model
the flow. We propose, validate, and employ an algorithm to induce the underlying propagation network from a sequence of posts, and report on the results.

2012年4月12日木曜日

CREST H23 年度 RA 成果報告会

CREST 平成23年度の RA の成果報告会を実施。パターンマッチング、スペクトラルクラスタリング、Random Walk with Restart、協調フィルタリング、ストリーム処理系の5つを5名の RA が実装。いずれも短期間ながら良い成果が出てきたと思います。今年も3名が引き続き、RA を続けていきます。小規模クラスタだけではなく、今年は TSUBAME を使った大規模実験およびそれに向けた最適化を実現したいと思います。

2012年4月3日火曜日

Performance Optimization for X10-based Large-scale Traffic Simulation

We are intensively working on the performance optimization for large-scale traffic simulation based on X10. What we are firstly facing is the synchronization overhead. If computational loads are not sufficient for given compute resources, the performance scalability can not be obtained due the the simulation model in that every synchronization is occurred at each step. The "step" here corresponds to "second". If the target road network is not that large, that synchronization can not be ignored. However, if we need to target larger network, such synchronization needs to be treated more carefully since all the cross points should be synchronized at every "steps". We introduce more relaxed simulation model in that the synchronization is occurred with longer interval. Determining the appropriate interval time is greatly important and the tradeoff between the interval and simulation precision exist. Thus we are currently working what's the most appropriate interval by given two facts, underlying execution environment and the precision threshold level. If we could find such a good balance, we will definitely be able to achieve more performance for given compute resources. The simulation itself had not been out of our research scope, but a lot of graph techniques could be leveraged for the performance optimization. Anyway, let's go on and publish this exciting research soon to international conferences.

2012年3月31日土曜日

送別会

松浦君、西井君、森田君が卒業。グエン君は大学院より亀井研究室に移動します。皆さん、これからも頑張ってください。


2012年3月17日土曜日

Great achievement

Three years have passed since we established our lab back in 2009. Just for only 3 years, we have published 36 papers, which is tremendous achievement !! We should accelerate this effort to international conferences more from this April.

Network Visualization Demo with Processing

Network Visualization Demonstration by Max-Plank Institute with the Processing programming language'. The visualization is out of our research scope, but it is strongly important to have this kind of demonstration to entertain the general people and show our innovative technologies.

URL: http://max-planck-research-networks.net/

2012年2月12日日曜日

Updated list of top-quality international conferences

WISE: International Conference on Web Information Systems Engineering, http://www.wise2012.cs.ucy.ac.cy/, 5/18/2012 Due

WSDM: ACM International Conference on Web Search and Data Mining, Paper Due --> August

IV: IEEE Intelligent Vehicles Symposium, Paper Due --> February,

CSE / PP: SIAM Conference on Computational Science and Engineering / SIAM Conference on Parallel Processing for Scientific Computing (held in altnernating years)

Euro-Par: Parallel and Distributed Computing, Paper Due --> February,

HiPC: IEEE International Conference on High Performance Computing, Paper Due -> 5/16/2012
http://www.hipc.org/hipc2012/

HPDC: International ACM Symposium on High Performance Parallel and Distributed Computing

ICS: International Conference on Supercomputing --> January,

SPAA: ACM Symposium on Parallelism in Algorithms and Architectures

AAAI : AAAI Conference on Artificial Intelligence (AI)
ACL : Annual Meeting of the Association for Computational Linguistics (Natural Language Processing)
ACM CCS : ACM Conference on Computer and Communications Security (Security & Privacy)
ACM CHI : Computer-Human Interaction (HCI)
ACM SIGCOMM : ACM SIGCOMM Special Interest Group on Data Communications (Communications & Networking)
ACM SIGGRAPH : Special Interest Group on Graphics and Interactive Techniques (Graphics & Visualization)
ACM SIGIR : Special Interest Group on Information Retrieval (Web)
ACM SIGMETRICS : International Conference on Measurement and Modeling of Computer Systems (Performance Modeling & Analysis)
ACSAC : Annual Computer Security Applications Conference (Security & Privacy)
AMIA : American Medical Informatics Association Spring Congress & Annual Symposium (Medical Informatics)

ASPLOS : International Conference on Architectural Support for Programming Languages and Operating Systems (Computer Architecture, Programming Languages & Software Engineering)
BPM : Business Process Management Conference (Service Science)
IKM : ACM International Conference on Information and Knowledge Management (Web)
COLING : International Conference on Computational Linguistics (Natural Language Processing)
CSCW : Computer Supported Cooperative Work (HCI)


DSN : IEEE/IFIP International Conference on Dependable systems & Networks (Distributed & Fault Tolerant Computing)
ECOOP : European Conference on Object-Oriented Programming (Programming Languages & Software Engineering)
EDBT : International Conference on Extending Database Technology (Data Management)



HPCA : International Symposium on High Performance Computer Architecture (Computer Architecture)

ICDCS : IEEE International Conference on Distributed Computing Systems (Distributed & Fault Tolerant Computing) --> 2012, November, Paper Due


ICDE : International Conference on Data Engineering (Data Management)
ICDM : IEEE International Conference on Data Mining (Knowledge Discovery & Data Mining)
ICDT : International Conference on Database Theory (Data Management)

ICML : International Conference on Machine Learning (AI, Knowledge Discovery & Data Mining)
ICPR : International Conference on Pattern Recognition (User Interface Technologies)
ICSE : ACM/IEEE International Conference on Software Engineering (Programming Languages & Software Engineering)

ICSOC : International Conference on Service Oriented Computing (Services Computing)-->4/29/2012

ICWS : IEEE International Conference on Web Services (Services Computing)

IEEE ICNP : IEEE International Conference on Network Protocols (Communications & Networking)

IEEE INFOCOM : Conference on Computer Communications (Communications & Networking, Performance Modeling & Analysis)

IEEE INFOVIS : Symposium on Information Visualization (Graphics & Visualization)
IEEE SYMPOSIUM ON SECURITY AND PRIVACY (Security & Privacy)
IEEE VISUALIZATION (Graphics & Visualization)

IFIP PERFORMANCE : International Symposium on Computer Performance, Modeling, Measurements, and Evaluation (Performance Modeling & Analysis)

IMC : Internet Measurement Conference (Communications & Networking)
http://www-net.cs.umass.edu/imc2012/cfp.htm -> 2012/05/11

IPDPS : International Parallel and Distributed Processing Symposium (Supercomputing) -> 2012/10/1 (Paper Due) 



MIDDLEWARE (Distributed & Fault Tolerant Computing)
http://middleware2012.cs.mcgill.ca/doku.php?id=start
Paper Due: 2012/05/25


KDD : ACM International Conference on Knowledge Discovery & Data Mining (Knowledge Discovery & Data Mining)



SODA : ACM-SIAM Symposium on Discrete Algorithms (Algorithms & Theory)
SOSP : ACM Symposium on Operating Systems Principles (Operating Systems)


FSE : ACM SIGSOFT Symposium on Foundations of Software Engineering (Programming Languages & Software Engineering)
INTERACT : Interact (HCI)
IPCO : Conference on Integer Programming and Combinatorial Optimization (Operations Research)
IUI : International Conference on Intelligent User Interfaces (AI, HCI)
IJCAI : International Joint Conference on AI (AI)
MEDINFO : World Congress on Medical and Health Informatics (Medical Informatics)
MICCAI : Medical Image Computing and Computer-Assisted Intervention (Medical Informatics)


UAI : Conference on Uncertainty in Artifical Intelligence (AI)
UBICOMP : International Conference on Ubiquitous Computing (Mobile Computing)
UIST : ACM Symposium on User Interface Software and Technology (HCI)

MIE : International Congress of the European Federation for Medical Informatics (Medical Informatics)
MOBICOM : International Conference on Mobile Computing and Networking (Communications & Networking, Mobile Computing)
MOBISYS : International Conference on Mobile Systems, Applications, and Services (Communications & Networking, Mobile Computing)
NDSS : Network and Distributed System Security Symposium (Security & Privacy)
NIPS : Conference on Neural Information Processing Systems (AI)
OSDI : USENIX Symposium on Operating Systems Design and Implementation (Operating Systems)
PACT : International Conference on Parallel Architectures and Compiler Techniques (Computer Architecture)
PLDI : ACM Conference on Programming Language Design and Implementation (Programming Languages & Software Engineering)
SDM : SIAM conference on Data Mining (Knowledge Discovery & Data Mining)
USENIX FAST : Conference on File and Storage Technologies (Storage Systems)

USENIX NSDI : USENIX Symposium on Networked Systems Design and Implementation (Communications & Networking)
Medical Informatics
IEEE EMBC: International Conference of the IEEE Engineering in Medicine and Biology Society
IEEE ISBI: International Symposium on Biomedical Imaging
IHI: ACM International Health Informatics Symposium
SIIM: Symposium on Imaging Informatics in Medicine
SPIE Medical Imaging: Society of Photo-Optical Instrumentation Engineers
Graphics & Visualization
EuroVis: Eurographics/IEEE Symposium on Visualization
IEEE PacificVis: Pacific Visualization Symnposium
SoftVis: Software Visualization
VAST: Visual Analytics Science and Technology
VR: IEEE Virtual Reality

SIGMOD/PODS : International Conference on Management of Data/Principles of Database Systems (Data Management)

SPLASH: Systems, Programming, Languages, and Applications: Software for Humanity (Programming Languages & Software Engineering)


SUPERCOMPUTING : International Conference for High Performanance Computing, Networking, Storage, and Analysis (Super Computing)Paper Due: 2012/04/20

USENIX : USENIX Annual Technical Conference (Operating Systems)


VLDB : Very Large Databases Conference (Data Management)
WWW : International World Wide Web Conference (Web)

-----
SYSTOR: Israeli Experimental Systems Conference

CLOUD: IEEE International Conference on Cloud Computing

CGO: Symposium on Code Generation and Optimization

VEE: Virtual Execution Environments

PADTAD: Parallel and Distributed Systems: Testing, Analysis, and Debugging
Due: April
QEST: International Conference on Quantitative Evaluation of Systems
http://www.qest.org/qest2012/
Paper Due: 2012/03/12

IISWC: IEEE International Symposium on Workload Characterization
http://www.iiswc.org/iiswc2012/index.html
Paper Due: 2012/05/25

ISPASS: IEEE International Symposium on Performance Analysis of Systems and Software
http://ispass.org/ispass2012/
Paper Due: October

MASCOTS: IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication
http://mascots.cs.vt.edu/mascots2012/?page_id=2
Paper Due: 2012/04/08


Distributed & Fault-Tolerant Computing
CNSM: Conference on Network and Services Management

DEBS: International Conference on Distributed Event-Based Systemshttp://www.csw.inf.fu-berlin.de/debs2012/
Paper Due: March

DISC: International Symposium on Distributed Computing
http://disc2011.dis.uniroma1.it/dates.php?lang=eng
2012/04/29

HotCloud: USENIX Workshop on Hot Topics in Cloud Computing
ICAC: International Conference on Autonomic Computing


LADIS: ACM SIGOPS & SIGACT Workshop on Large Area Distributed Systems & Middleware
SOCC: ACM Symposium on Cloud Computing
SRDS: IEEE International Symposium on Reliable Distributed Systems

Data Management
ACMGIS: ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

WAIM: Web-Age Information Management
http://db.hit.edu.cn/waim2012/
Due: March

WebDB: International Workshop on Web and Databases

ACM Sensys: ACM Conference on Embedded Networked Sensor Systems

AAMAS: International Conference on Autonomous Agents --> October,

ECML PKDD: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases

2012年2月11日土曜日

2012年1月31日火曜日

X10 関係

- インクリメンタルビルドのツールまたはシェルスクリプト
- ソースコードデバッグ
- Optimization のオプションをつけたときにバグ

2012年1月21日土曜日

Graph research activities in Japan

京都大学 吉川 正俊 先生のグループ
http://kyouindb.iimc.kyoto-u.ac.jp/e/wA3vS

Yasushi Sakurai, Lei Li, Yasuko Matsubara, Christos Faloutsos, ”WindMine: Fast and Effective Mining of Web-click Sequences”, 11th SIAM International Conference on Data Mining (SDM), Mesa, Arizona, April 28-30, 2011.
A Bipartite Graph Model and Mutually Reinforcing Analysis for Review Sites
Towards Improving Wikipedia as an Image-rich Encyclopedia through Analyzing Appropriateness of Images for an Article
Enishi: Searching Knowledge about Relations by Complementarily Utilizing Wikipedia and the Web
Mining and Explaining Relationships in Wikipedia

Graph Database

S. Sakr (2009) “Storing and Querying Graph Data Using Efficient Relational Processing Techniques”. Proceedings of the 8th International Conference on Information Systems Technology and its Applications (ISTA/UNISCON 2009), Sydney, Australia

S. Sakr (2009) “GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries”. Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA 2009), Brisbane, Australia

S. Sakr, G. Al-Naymat (2010) “Graph Indexing and Querying: A Review”. International Journal of Web Information Systems (IJWIS) 6(2): 101-120 [PDF] [Citations].

S. Sakr, G. Al-Naymat (2010) “An Efficient Features-Based Processing Technique for Supergraph Queries”. Proceedings of the 14th International Database Engineering and Applications Symposium (IDEAS 2010), Montreal, QC, CANADA, [PDF].

S. Sakr, G. Al-Naymat (2010) “Efficient Relational Techniques for Processing Graph Queries”. Journal of Computer Science and Technology (JCST) 25(6): 1237-1255, [PDF].

Taihei Oshino, Yasuhito Asano, and Masatoshi Yoshikawa
Mining Useful Time Graph Patterns on Extensively Discussed Topics on the We
http://dl.acm.org/citation.cfm?id=1880857

Graph Data Management: Techniques and Applications [Hardcover]
Sherif Sakr (Author, Editor), Eric Pardede (Editor)

Accepted papers in GDM2012

A Comparison of Current Graph Database Models
Benchmarking Traversal Operations Over Graph Databases
Mining Associations Using Directed Hypergraphs
Design of Declarative Graph Query Languages: On the Choice between Value, Pattern and Object based Representations for Graphs
Finding Skyline Nodes in Large Networks (Invited Paper)
Partitioning Social Networks for Fast Retrieval of Time-dependent Queries (Invited Paper)
Will Graph Data Management Techniques Contribute to the Successful Large-Scale Deployment of Semantic Web Technologies?
Graph-based Knowledge Representation: Computational Foundations of Conceptual Graphs (Advanced Information and Knowledge Processing) [Paperback]

2012年1月11日水曜日

NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce

NIMBLE: A Toolkit for the Implementation of Parallel Data
Mining and Machine Learning Algorithms on MapReduce
http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p334.pdf


In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and
data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the
Message Passing Interface (MPI) have been the de facto choices
for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level
and ill-suited for implementing ML-DM algorithms. To address
this deficiency, we present NIMBLE, a portable infrastructure that
has been specifically designed to enable the rapid implementation
of parallel ML-DM algorithms. The infrastructure allows one to
compose parallel ML-DM algorithms using reusable (serial and
parallel) building blocks that can be efficiently executed using MR
and other parallel programming models; it currently runs on top of
Hadoop, which is an open-source MR implementation. We show
how NIMBLE can be used to realize scalable implementations of
ML-DM algorithms and present a performance evaluation

2012年1月7日土曜日

Wikipedia SOM (Self Organizing Map)

Wikipedia SOM

https://kaigi.org/jsai/webprogram/2011/pdf/329.pdf


Web ブラウザを経由して誰でも編集可能なオンライン百科
事典「Wikipedia」は,半構造化されたデータ構造を持ち,幅
広い分野に高い網羅性を持つなどの特徴を持つことから,人
工知能,自然言語処理,Web マイニングをはじめとする各種
の研究分野で,コーパスとして活用されてきた.Wikipedia 上
に公開される情報は日々増加しており,全ての言語を合計する
と,1,800 万以上の記事が存在する.この結果,どの分野にど
の程度の情報が存在し,分野同士がどうつながっているのか,
といったような Wikipedia の全体像を把握することが困難に
なっている.Wikipedia をコーパスとして利用する研究にお
いては,データの特性に応じてアルゴリズムを設計することが
多いため,どのような記事集合がどれほどあり,どのようなク
ラスタがあるのか,クラスタ間の関係はどうなっているのかな
ど,全体を俯瞰することが重要である.また,Wikipedia の閲
覧や編集など一般ユーザとして関わる場合にも,全体を俯瞰す
ることは,不足している情報を把握することや分野同士の関係
性を調べるといった用途において重要であると考えられる.
本研究では,神経細胞移動に着想を得た自己組織化マップ
アルゴリズムの「MIGSOM」[Nakayama 11] をWikipedia に
適用し,全体情報を俯瞰する方法を提案する.MIGSOM は,
大規模な疎データを可視化し,文書マップを作成する技術であ
る.MIGSOM には二つの特徴がある.一つは大規模なデータ
に適用した時にも安定したクラスタリング性能が期待できる点
である.もう一方の特徴は,ズーム機能を利用したクラスタ解
析が可能な点である.これにより,大局的なクラスタと局所的
なクラスタの解析が可能になっ

Graph + Bioinformatics

Causal graph-based analysis of genome-wide association data in rheumatoid arthritis
http://www.biology-direct.com/content/6/1/25

Graph-based methods for analysing networks in cell biology
http://bib.oxfordjournals.org/content/7/3/243.full
Availability of large-scale experimental data for cell biology is enabling computational methods to systematically model the behaviour of cellular networks. This review surveys the recent advances in the field of graph-driven methods for analysing complex cellular networks. The methods are outlined on three levels of increasing complexity, ranging from methods that can characterize global or local structural properties of networks to methods that can detect groups of interconnected nodes, called motifs or clusters, potentially involved in common elementary biological functions. We also briefly summarize recent approaches to data integration and network inference through graph-based formalisms. Finally, we highlight some challenges in the field and offer our personal view of the key future trends and developments in graph-based analysis of large-scale datasets.

Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912890/

The investigation of plant genome structure and evolution requires comprehensive characterization of repetitive sequences that make up the majority of higher plant nuclear DNA. Since genome-wide characterization of repetitive elements is complicated by their high abundance and diversity, novel approaches based on massively-parallel sequencing are being adapted to facilitate the analysis. It has recently been demonstrated that the low-pass genome sequencing provided by a single 454 sequencing reaction is sufficient to capture information about all major repeat families, thus providing the opportunity for efficient repeat investigation in a wide range of species. However, the development of appropriate data mining tools is required in order to fully utilize this sequencing data for repeat characterization.
Results
We adapted a graph-based approach for similarity-based partitioning of whole genome 454 sequence reads in order to build clusters made of the reads derived from individual repeat families. The information about cluster sizes was utilized for assessing the proportion and composition of repeats in the genomes of two model species, Pisum sativum and Glycine max, differing in genome size and 454 sequencing coverage. Moreover, statistical analysis and visual inspection of the topology of the cluster graphs using a newly developed program tool, SeqGrapheR, were shown to be helpful in distinguishing basic types of repeats and investigating sequence variability within repeat families.
Conclusions
Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.

Graph-based data mining for biological applications
https://lirias.kuleuven.be/bitstream/123456789/267094/1/phd_leander_schietgat_final_version.pdf

Graph + Medical Informatics

Graph-based signal integration for patient diagnosis discovery
http://www.biotconf.org/biot2011/BIOT-2011-EA-Herskovic-Bernstam.pdf

Biomedical information problems are often difficult to
solve with computational techniques. One reason for this
difficulty is that many qualitatively different information
types must be combined to solve the problem. For example,
diagnosing a disease requires that information such as
patient demographics, past medical history, disease
prevalence, and current complaints be integrated in some
rational way. In this paper, we describe an approach that
leverages graphs to integrate several information sources to
“diagnose” a patient record

A three-level graph-based model for the management of hospital information systems.
http://www.ncbi.nlm.nih.gov/pubmed/7476470
Information processing in hospitals, especially in university hospitals, is currently faced with two major issues: low-cost hardware and progress in networking technology leads to a further decentralization of computing capacity, due to the increasing need for information processing in hospitals and due to economic restrictions, it is necessary to use commercial software products. This leads to heterogeneous hospital information systems using a variety of software and hardware products, and to a stronger demand for integrating these products and, in general, for a dedicated methodology for the management of hospital information systems to support patient care and medical research. We present a three-level graph-based model (3LGM) to support the systematic management of hospital information systems. 3LGM can serve as a basis for assessing the quality of information processing in hospitals. 3LGM distinguishes between a procedural level for describing the information procedures (and their information interchange) of a hospital information system and thus its functionality, a logical too level, focusing on application systems and communication links, and a physical tool level with physical subsystems (e.g., computer systems) and data transmission. The examples that are presented have been taken from the Heidelberg University Hospital Information System.

Integrated multimedia electronic patient record and grap
http://perso.telecom-paristech.fr/~bloch/papers/CBM-08.pdf
Current electronic patient record (EPR) implementations do not incorporate medical images, nor structural information extracted from them,
despite images increasing role for diagnosis. This paper presents an integration framework into EPRs of anatomical and pathological knowledge
extracted from segmented magnetic resonance imaging (MRI), applying a graph of representation for anatomical and functional information
for individual patients. Focusing on cerebral tumors examination and patient follow-up, multimedia EPRs were created and evaluated through a
3D navigation application, developed with open-source libraries and standards. Results suggest that the enhanced clinical information scheme
could lead to original changes in the way medical experts utilize image-based information.


A Graph-based Bagging
http://www.cse.iitm.ac.in/CoLISD/Papers/4.pdf
The ensemble technique weights several individual classi ers
and combines them to obtain a composite classi er that outperforms each
of them alone. Despite of this technique has been successfully applied in
many areas, in the literature review we did not nd its application on
networked data. To tackle with this lack, we propose a bagging proce-
dure applied to graph-based classi ers. Our contribution was to develop
a bagging procedure for networked data which either maintains or signif-
icantly improves the performance of the tested relational classi ers. Ad-
ditionally, we investigate the role played by diversity among the several
individual classi ers to explain when the technique can be successfully
applied.

Graph-based learning for information systems
http://gradworks.umi.com/33/52/3352368.html


Design and Evaluation of a Temporal, Graph-based Language for Querying Collections of Patient Histories

Giving clinicians and researchers the ability to easily retrieve and explore relevant fragments of patient histories would greatly facilitate quality assurance, patient follow-up and research on patient treatment processes. Established database query languages are inconvenient for such exploration, and may also be too complex for users with limited backgrounds in informatics. We believe that understandability can be increased in return for a sacrifice of some of the power of expression found in general query languages. In order to design a specialized query language, we have collected and synthesized a tentative list of requirements. Based on these requirements, we have designed and implemented Practice Explorer, a prototype for visual query of collections of patient histories, and evaluated the understandability of its query language by testing with medical students. The results indicate that parts of the language are intuitive enough for users to understand without demonstrations, examples, feedback or assistance. They also provide some lessons for future work in this area.


Graph-based Word Sense Disambiguation of Biomedical Documents
http://bioinformatics.oxfordjournals.org/content/early/2010/10/07/bioinformatics.btq555.full.pdf
Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage of text processing. This paper presents a graph-based
approach to WSD in the biomedical domain. The method is unsupervised and does not require any labeled training data. It makes use
of knowledge from the Unified Medical Language System (UMLS)
Metathesaurus which is represented as a graph. A state-of-the-art
algorithm, Personalized PageRank, is used to perform WSD.
Results: When evaluated on the NLM-WSD dataset the algorithm
outperforms other methods that rely on the UMLS Metathesaurus
alone


Graph-Based Shape Analysis for MRI Classification
http://www.igi-global.com/article/international-journal-knowledge-discovery-bioinformatics/62299
Searching for correlations between brain structure and attributes of a person’s intellectual state is a process which may be better done by automation than by human labor. Such an automated system would be capable of performing classification based on the discovered correlation, which would be means of testing how accurate the discovered correlation is. The authors have developed a system which generates a graph-based representation of the shape of the third and lateral ventricles based on a structural MRI, and classifies images represented in this manner. The system is evaluated on accuracy at classifying individuals showing cognitive impairment to Alzheimer’s Disease. Classification accuracy is 74.2% when individuals with CDR 0.5 are included as impaired in a balanced dataset of 166 images, and 79.3% accuracy when differentiating individuals with CDR at least 1.0 and healthy individuals in a balanced dataset of 54 images. Finally, the system is used to classify MR images according to level of education, with 77.2% accuracy differentiating highly-educated individuals from those for whom no higher education is listed, in a balanced dataset of 178 images.


Graph Based Detection of Optic Disc and Fovea in Retinal Images
http://www.inf.unideb.hu/~hajdua/papers/hajduc_2010h.pdf

Diabetic retinopathy (DR) is the damage to the eye's
retina that occurs with long-term diabetes, which can eventually
lead to blindness. Screening programs for DR are being
introduced, however, an important prerequisite for automation is
the accurate localization of the main anatomical features in the
image, notably the optic disc (OD) and the macula. A series of
interesting algorithms have been proposed in the recent past and
the performance is generally good, but each method has
situations, where it fails. This paper presents a novel framework
for automatic detection of optic disc and macula in retinal fundus
images using a combination of different optic disc and macula
detectors represented by a weighted complete graph. A node
pruning procedure removes the worst vertices of the graph by
satisfying predefined geometric constraints and get best possible
detector outputs to be combined using a weighted average. The
extensive tests have shown that combining the predictions of
multiple detectors is more accurate than any of the individual
detectors making up the ensemble.

Power Watershed: A Unifying Graph-Based Optimization Framework
http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2010.200

In this work, we extend a common framework for graph-based image segmentation that includes the graph cuts, random walker, and shortest path optimization algorithms. Viewing an image as a weighted graph, these algorithms can be expressed by means of a common energy function with differing choices of a parameter q acting as an exponent on the differences between neighboring nodes. Introducing a new parameter p that fixes a power for the edge weights allows us to also include the optimal spanning forest algorithm for watershed in this same framework. We then propose a new family of segmentation algorithms that fixes p to produce an optimal spanning forest but varies the power q beyond the usual watershed algorithm, which we term the power watershed. In particular, when q=2, the power watershed leads to a multilabel, scale and contrast invariant, unique global optimum obtained in practice in quasi-linear time. Placing the watershed algorithm in this energy minimization framework also opens new possibilities for using unary terms in traditional watershed segmentation and using watershed to optimize more general models of use in applications beyond image segmentation.


MEDRank: using graph-based concept ranking to index biomedical texts.
http://www.mendeley.com/research/medrank-using-graphbased-concept-ranking-index-biomedical-texts-1/
As the volume of biomedical text increases exponentially, automatic indexing becomes increasingly important. However, existing approaches do not distinguish central (or core) concepts from concepts that were mentioned in passing. We focus on the problem of indexing MEDLINE records, a process that is currently performed by highly trained humans at the National Library of Medicine (NLM). NLM indexers are assisted by a system called the Medical Text Indexer (MTI) that suggests candidate indexing terms.


A semantic graph-based approach to biomedical summarisation
http://dl.acm.org/citation.cfm?id=2016307
Access to the vast body of research literature that is available in biomedicine and related fields may be improved by automatic summarisation. This paper presents a method for summarising biomedical scientific literature that takes into consideration the characteristics of the domain and the type of documents. Methods: To address the problem of identifying salient sentences in biomedical texts, concepts and relations derived from the Unified Medical Language System (UMLS) are arranged to construct a semantic graph that represents the document. A degree-based clustering algorithm is then used to identify different themes or topics within the text. Different heuristics for sentence selection, intended to generate different types of summaries, are tested. A real document case is drawn up to illustrate how the method works. Results: A large-scale evaluation is performed using the recall-oriented understudy for gisting-evaluation (ROUGE) metrics. The results are compared with those achieved by three well-known summarisers (two research prototypes and a commercial application) and two baselines. Our method significantly outperforms all summarisers and baselines. The best of our heuristics achieves an improvement in performance of almost 7.7 percentage units in the ROUGE-1 score over the LexRank summariser (0.7862 versus 0.7302). A qualitative analysis of the summaries also shows that our method succeeds in identifying sentences that cover the main topic of the document and also considers other secondary or ''satellite'' information that might be relevant to the user. Conclusion: The method proposed is proved to be an efficient approach to biomedical literature summarisation, which confirms that the use of concepts rather than terms can be very useful in automatic summarisation, especially when dealing with highly specialised domains.


グラフ構造を用いた時系列関係の発見(9月15日)(<特集>「アクティブマイニング」及び一般)
http://ci.nii.ac.jp/naid/110002664364
医療データを取り扱った知識発見で,最も難しい部分は,不均質に発生する時系列データの扱い方である.本論文では,そのようなデータを取り扱うための,一階述語論理を使った規則発見手法を提案する.提案手法は,時系列データの表現にグラフを採り入れて,そのデータをある規則にしたがって書き換えることで,規則に使われる述語の候補の生成を行う.評価のために,実際の医療データを用いて実験を行い,手法の有効性を示した.

PuReD-MCL: a graph-based PubMed document clustering methodology
http://bioinformatics.oxfordjournals.org/content/24/17/1935.full


A graph-based representation of Gene Expression profiles in DNA microarrays
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4675762
This paper proposes a new and very flexible data model, called gene expression graph (GEG), for genes expression analysis and classification. Three features differentiate GEGs from other available microarray data representation structures: (i) the memory occupation of a GEG is independent of the number of samples used to built it; (ii) a GEG more clearly expresses relationships among expressed and non expressed genes in both healthy and diseased tissues experiments; (iii) GEGs allow to easily implement very efficient classifiers. The paper also presents a simple classifier for sample-based classification to show the flexibility and user-friendliness of the proposed data structure.

GOLEM: an interactive graph-based gene-ontology navigation and analysis tool
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618863/
The Gene Ontology has become an extremely useful tool for the analysis of genomic data and structuring of biological knowledge. Several excellent software tools for navigating the gene ontology have been developed. However, no existing system provides an interactively expandable graph-based view of the gene ontology hierarchy. Furthermore, most existing tools are web-based or require an Internet connection, will not load local annotations files, and provide either analysis or visualization functionality, but not both.



http://www.biomedcentral.com/1471-2105/8/S9/S4/

Graph + Text Analysis

Plagiarism Detection Using Graph-Based Representation
http://www.mendeley.com/research/plagiarism-detection-using-graphbased-representation/

Plagiarism of material from the Internet is a widespread and growing problem. Several methods used to detect the plagiarism and similarity between the source document and suspected documents such as fingerprint based on character or n-gram. In this paper, we discussed a new method to detect the plagiarism based on graph representation; however, Preprocessing for each document is required such as breaking down the document into its constituent sentences. Segmentation of each sentence into separated terms and stop word removal. We build the graph by grouping each sentence terms in one node, the resulted nodes are connected to each other based on order of sentence within the document, all nodes in graph are also connected to top level node "Topic Signature". Topic signature node is formed by extracting the concepts of each sentence terms and grouping them in such node. The main advantage of the proposed method is the topic signature which is main entry for the graph is used as quick guide to the relevant nodes. which should be considered for the comparison between source documents and suspected one. We believe the proposed method can achieve a good performance in terms of effectiveness and efficiency.



テキスト分析における2部グラフクラスタリングの可能性(テキストの類似性・文処理モデル)
Possibilities of the Bipartite Graph Clustering in Text Analysis

http://ci.nii.ac.jp/naid/110004809727
テキスト解析にマルコフクラスタリング(MCL)、およびそれを独自に改良したリカレント・マルコフクラスタリング(RMCL)を利用する場合、有効なデータ取得法として、キーワードと共起語のペアに基づく、2部グラフ化の方法を提案し、MCL-RMCLによる2部グラフクラスタリングの計算結果を、従来のベクトル空間モデルに基づく多変量解析と比較、有効性を検証する.

Graph + CyberSecurity (2)

EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs
http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.203

In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.


Mining and Modeling Real Graphs: Patterns, Generators, Anomalies, and Tools
http://www.cs.cmu.edu/~lakoglu/proposal/lakoglu-proposal.pdf




Community-based anomaly detection in evolutionary networks
http://www.springerlink.com/content/b61165511117u863/
Networks of dynamic systems, including social networks, the World Wide Web, climate networks, and biological networks, can be highly clustered. Detecting clusters, or communities, in such dynamic networks is an emerging area of research; however, less work has been done in terms of detecting community-based anomalies. While there has been some previous work on detecting anomalies in graph-based data, none of these anomaly detection approaches have considered an important property of evolutionary networks—their community structure. In this work, we present an approach to uncover community-based anomalies in evolutionary networks characterized by overlapping communities. We develop a parameter-free and scalable algorithm using a proposed representative-based technique to detect all six possible types of community-based anomalies: grown, shrunken, merged, split, born, and vanished communities. We detail the underlying theory required to guarantee the correctness of the algorithm. We measure the performance of the community-based anomaly detection algorithm by comparison to a non–representative-based algorithm on synthetic networks, and our experiments on synthetic datasets show that our algorithm achieves a runtime speedup of 11–46 over the baseline algorithm. We have also applied our algorithm to two real-world evolutionary networks, Food Web and Enron Email. Significant and informative community-based anomaly dynamics have been detected in both cases.



Using Bayesian Networks for Cyber Security Analysis
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5544924
Capturing the uncertain aspects in cyber security is important for security analysis in enterprise networks. However, there has been insufficient effort in studying what modeling approaches correctly capture such uncertainty, and how to construct the models to make them useful in practice. In this paper, we present our work on justifying uncertainty modeling for cyber security, and initial evidence indicating that it is a useful approach. Our work is centered around near real-time security analysis such as intrusion response. We need to know what is really happening, the scope and severity level, possible consequences, and potential countermeasures. We report our current efforts on identifying the important types of uncertainty and on using Bayesian networks to capture them for enhanced security analysis. We build an example Bayesian network based on a current security graph model, justify our modeling approach through attack semantics and experimental study, and show that the resulting Bayesian network is not sensitive to parameter perturbation.


Applying Graph-Based Anomaly Detection Approaches to the Discovery of Insider Threats
http://eecs.wsu.edu/~holder/pubs/EberleISI09.pdf

The ability to mine data represented as a graph has
become important in several domains for detecting various
structural patterns. One important area of data mining is
anomaly detection, but little work has been done in terms of
detecting anomalies in graph-based data. In this paper we
present graph-based approaches to uncovering anomalies in
applications containing information representing possible insider
threat activity: e-mail, cell-phone calls, and order processing.




Graph-based malware detection using dynamic analysis
http://www.mendeley.com/research/graphbased-malware-detection-using-dynamic-analysis/

Visualizing graph dynamics and similarity for enterprise network security and management
Managing complex enterprise networks requires an understanding at a finer granularity than traditional network monitoring. The ability to correlate and visualize the dynamics and inter-relationships among various network components such as hosts, users, and applications is non-trivial. In this paper, we propose a visualization approach based on the hierarchical structure of similarity/difference visualization in the context of heterogeneous graphs. The concept of hierarchical visualization starts with the evolution of inter-graph states, adapts to the visualization of intra-graph clustering, and concludes with the visualization of similarity between individual nodes. Our visualization tool, ENAVis (Enterprise Network Activities Visualization), quantifies and presents these important changes and dynamics essential to network operators through a visually appealing and highly interactive manner. Through novel graph construction and transformation, such as network connectivity graphs, MDS graphs, bipartite graphs, and similarity graphs, we demonstrate how similarity/dynamics can be effectively visualized to provide insight with regards to network understanding.



A Graph Similarity-based Approach to Security
Event Analysis Using Correlation Techniques

http://dl.acm.org/citation.cfm?id=1850799
—Detecting and identifying security events to provide
cyber situation awareness has become an increasingly important
task within the network research and development community.
We propose a graph similarity-based approach to event detection
and identification that integrates a number of techniques to
collect time-varying situation information, extract correlations
between event attributes, and characterize and identify security
events. Diverging from the traditional rule- or statistical-based
pattern matching techniques, the proposed mechanism represents
security events in a graphical form of correlation networks and
identifies security events through the computation of graph similarity measurements to eliminate the need for constructing user
or system profiles. These technical components take fundamentally different approaches from traditional empirical or statistical
methods and are designed based on rigorous computational
analysis with mathematically proven performance guarantee.
The performance superiority of the proposed mechanism is
demonstrated by extensive simulation and experimental results

Graph Based Statistical Analysis of Network Traffic
http://www.cs.purdue.edu/mlg2011/papers/paper_10.pdf

We propose a method for analyzing tra c data in large com-
puter networks such as big enterprise networks or the In-
ternet. Our approach combines graph theoretical represen-
tation of the data and graph analysis with novel statistical
methods for discovering pattern and time-related anomalies.
We model the tra c as a graph and use temporal charac-
teristics of the data in order to decompose it into subgraphs
corresponding to individual sessions, whose characteristics
are then analyzed using statistical methods. The goal of
that analysis is to discover patterns in the network tra c
data that might indicate intrusion activity or other mali-
cious behavior


Using Graph Theory to Detect Security Policy Violators
http://www.mawode.com/~waltman/talks/plug05emailgraph.pdf

Understanding Multistage Attacks by Attack-Track based Visualization of Heterogeneous Event Streams
http://www.cse.buffalo.edu/~shambhu/documents/pdf/vizsec01s-mathew.pdf

In this paper, we present a method of handling the visualization of hetereogeneous event traffic that is generated by
intrusion detection sensors, log files and other event sources
on a computer network from the point of view of detecting
multistage attack paths that are of importance. We perform
aggregation and correlation of these events based on their semantic content to generate Attack Tracks that are displayed
to the analyst in real-time. Our tool, called the Event Correlation for Cyber-Attack Recognition System (ECCARS) enables the analyst to distinguish and separate an
evolving multistage attack from the thousands of events generated on a network. We focus here on presenting the environment and framework for multistage attack detection
using ECCARS along with screenshots that demonstrate its
capabilities.
Categories an

Scenario Graphs Applied to Network Security
http://www.cs.cmu.edu/~scenariograph/wing07.pdf

Traditional model checking produces one counterexample to illustrate a violation of a property by a
model of the system. Some applications benefit from having all counterexamples, not just one. We call this set of
counterexamples a scenario graph. In this chapter we present two different algorithms for producing scenario graphs
and explain how scenario graphs are a natural representation for attack graphs used in the security community.
Through a detailed concrete example, we show how we can model a computer network and generate and analyze
attack graphs automatically. The attack graph we produce for a network model shows all ways in which an intruder
can violate a given desired security property

Network Security Evaluation through Attack Graph Generation
http://www.waset.org/journals/waset/v54/v54-73.pdf
In today’s network, security evaluation is a challenging
task for most of the administrators. The typical means by which an
attacker breaks into a network is through a series of exploits, where
each exploit in the series satisfies the precondition for subsequent
exploits and makes a causal relationship among them. Such a series of
exploits constitutes an attack path and the set of all possible attack
paths form an attack graph. Even the well administered networks are
susceptible to such attacks as present day vulnerability scanners are
only able to identify the vulnerabilities in isolation but there is a need
for logical formalism and correlation among these vulnerabilities
within a host or across multiple hosts to identify overall risk of the
network. In this paper we propose a network security analysis method
by the generation of network attack graph. After analyzing network
vulnerabilities, linking relation between devices and the characteristic
of attack, the model of network security states is built, and the
generating algorithm of attack graph is implemented. Attack graphs
are important tools for analyzing security vulnerabilities in enterprise
networks. The experiment validates the evaluation method we
proposed.

Graph + CyberSecurity

Making Cyber Security Decisions Through a. Quantitative Metrics Approach
http://eden.dei.uc.pt/~mvieira/raci2011/Sanders_RACI_keynote.pdf


Least Effort Strategies for Cybersecurit
http://arxiv.org/pdf/cond-mat/0306002

Cybersecurity is an issue of increasing concern since the events of September 11
Many questions have been raised concerning the security of the Internet and the rest of US’s
information infrastructure. This paper begins to examine the issue by analyzing the Internet’s
autonomous system (AS) map. Using the AS map, malicious infections are simulated and
different defense strategies are considered in a cost benefit framework. The results show that
protecting the most connected nodes provides significant gains in security and that after the small
minority of the most connected nodes are protected there are diminishing returns for further
protection. Although if parts of the small minority of the most connected firm are not protected,
such as non-US firms, protection levels are significantly decrea

Graph-based Data Mining

Graph-based data mining
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=850825
Using databases represented as graphs, the Subdue system performs two key data mining techniques: unsupervised pattern discovery and supervised concept learning from examples. Applications to large structural databases demonstrate Subdue's scalability and effectiveness

CyberSecurity + Graph

Cyber Security Risks Assessment with Bayesian Defense Graphs and Architectural Models
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4755419

To facilitate rational decision making regarding cyber security investments, decision makers need to be able to assess expected losses before and after potential investments. This paper presents a model based assessment framework for analyzing the cyber security provided by different architectural scenarios. The framework uses the Bayesian statistics based extended influence diagrams to express attack graphs and related countermeasures. In this paper it is demonstrated how this structure can be captured in an abstract model to support analysis based on architectural models. The approach allows calculating the probability that attacks will succeed and the expected loss of these given the instantiated architectural scenario. Moreover, the framework can handle the uncertainties that are accompanied to the analyses. In architectural analysis there are uncertainties acquainted both to the scenario and its properties, as well as to the analysis framework that stipulates how security countermeasures contribute to cyber security.

entrifuge 2.0 - Cyber Security Analysis - Identify Insights with Relationship Graphs
http://www.youtube.com/watch?v=8kJk0mBh8sg
http://www.youtube.com/watch?v=eZqxRGBwIO0&feature=related


Attack Graphs for Proactive Digital Forensics



A Mathematical Basis for Science-Based Cybersecurity

Mining Graph Patterns Efficiently via Randomized Summaries
http://www.cs.uiuc.edu/~hanj/pdf/vldb09_cchen.pdf

Graphs are prevalent in many domains such as Bioinformatics, social networks, Web and cyber-security. Graph pattern
mining has become an important tool in the management
and analysis of complexly structured data, where example
applications include indexing, clustering and classification.
Existing graph mining algorithms have achieved great success by exploiting various properties in the pattern space.
Unfortunately, due to the fundamental role subgraph isomorphism plays in these methods, they may all enter into a
pitfall when the cost to enumerate a huge set of isomorphic
embeddings blows up, especially in large graphs.
The solution we propose for this problem resorts to reduction on the data space. For each graph, we build a summary of it and mine this shrunk graph instead. Compared
to other data reduction techniques that either reduce the
number of transactions or compress between transactions,
this new framework, called Summarize-Mine, suggests a
third path by compressing within transactions. SummarizeMine is effective in cutting down the size of graphs, thus
decreasing the embedding enumeration cost. However, compression might lose patterns at the same time. We address
this issue by generating randomized summaries and repeating the process for multiple rounds, where the main idea is
that true patterns are unlikely to miss from all rounds. We
provide strict probabilistic guarantees on pattern loss likelihood. Experiments on real malware trace data show that
Summarize-Mine is very efficient, which can find interesting malware fingerprints that were not revealed previously


Cyber Security Link Analysis Graph
http://www.analyticbridge.com/photo/cyber-security-link-analysis
This visualization shows interesting patterns of behavior for recent network login traffic. The linkages are between source and destination IPs. The circular stars show one-to-one relationships representing normal behavior. But the unusual pattern in the lower central shows a destintation IP under attack -- it has over 100 hundred source IPs sending it traffic.


Oil Companies Stepping up Cyber Security as Hacking Attacks Increase
http://oilprice.com/Energy/Energy-General/Oil-Companies-Stepping-up-Cyber-Security-as-Hacking-Attacks-Increase.html


An Attack Graph Based Approach for Threat Identification of an Enterprise Network
http://www.igi-global.com/chapter/cyber-security-global-information-assurance/7409

The science of cyber security experimentation: the DETER project
http://dl.acm.org/citation.cfm?id=2076752
Since 2004, the DETER Cyber-security Project has worked to create an evolving infrastructure - facilities, tools, and processes - to provide a national resource for experimentation in cyber security. Building on our insights into requirements for cyber science and on lessons learned through 8 years of operation, we have made several transformative advances towards creating the next generation of DeterLab. These advances in experiment design and research methodology are yielding progressive improvements not only in experiment scale, complexity, diversity, and repeatability, but also in the ability of researchers to leverage prior experimental efforts of other researchers in the DeterLab user community. This paper describes the advances resulting in a new experimentation science and a transformed facility for cybersecurity research development and evaluation.


Blog Data Mining for Cyber Security Threats
http://ewinarko.staff.ugm.ac.id/datamining/tugas2/09-app-blogmining-cybersec.pdf
Blog data mining is a growing research area that addresses the domainspecific problem of extracting information from blog data. In our work, we analyzed
blogs for various categories of cyber threats related to the detection of security
threats and cyber crime. We have extended the Author-Topic model based on Latent Dirichlet Allocation for identify patterns of similarities in keywords and dates
distributed across blog documents. From this model, we visualized the content and
date similarities using the Isomap dimensionality reduction technique. Our findings
support the theory that our probabilistic blog model can present the blogosphere in
terms of topics with measurable keywords, hence aiding the investigative processes
to understand and respond to critical cyber security events and threats.


Insider Threat Detection Using Graph-Based Approach
http://eecs.wsu.edu/~holder/pubs/EberleCATCH09.pdf

Protecting our nation’s cyber infrastructure and
securing sensitive information are critical challenges
for homeland security and require the research,
development and deployment of new technologies
that can be transitioned into the field for combating
cyber security risks. Particular areas of concern are
the deliberate and intended actions associated with
malicious exploitation, theft or destruction of data, or
the compromise of networks, communications or
other IT resources, of which the most harmful and
difficult to detect threats are those propagated by an
insider. However, current efforts to identify
unauthorized access to information, such as what is
found in document control and management systems,
are limited in scope and capabilities.
In order to address this issue, this effort involves
performing further research and development on the
existing Graph-Based Anomaly Detection (GBAD)
system [3]. GBAD discovers anomalous instances of
structural patterns in data that represent entities,
relationships and actions. Input to GBAD is a labeled
graph in which entities are represented by labeled
vertices and relationships or actions are represented
by labeled edges between entities. Using the
minimum description length (MDL) principle to
identify the normative pattern that minimizes the
number of bits needed to describe the input graph
after being compressed by the pattern, GBAD
implements algorithms for identifying the three
possible changes to a graph: modifications,
insertions and deletions. Each algorithm discovers
those substructures that match the closest to the
normative pattern without matching exactly. As a
result, GBAD is looking for those activities that
appear to match normal (or legitimate) transactions,
but in fact are structurally different.
As a solution to the problem of insider threat
detection, we will apply GBAD to datasets that
represent the flow of information between entities, as
well as the actions that take place on the information.
This research involves the representation of datasets,
like a document control and management system, as a
graph, enhancement of GBAD’s performance levels,
and evaluation of GBAD on these datasets. In
previous research, GBAD has already achieved over
95% accuracy detecting anomalies in simulated
domains, with minimal false positives, on graphs of
up to 100,000 vertices.

Twitter + Lifelog

A location predictor based on dependencies between multiple lifelog data
http://dl.acm.org/citation.cfm?id=1867702


Towards trajectory-based experience sharing in a city
http://dl.acm.org/citation.cfm?id=2063221

Twitter as Economic Indicators

Twitter as Economic Indicators

http://dealmakersguide.com/twitter-as-an-economic-indicator/
http://professional.wsj.com/article/SB10001424052970204138204576598942105167646.html


Modeling public mood and emotion:
Twitter sentiment and socio-economic phenomena
http://arxiv.org/pdf/0911.1583

Twitter

Twitter under the microscope - First Monday

http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2317/206
Scholars, advertisers and political activists see massive online social networks as a representation of social interactions that can be used to study the propagation of ideas, social bond dynamics and viral marketing, among others. But the linked structures of social networks do not reveal actual interactions among people. Scarcity of attention and the daily rhythms of life and work makes people default to interacting with those few that matter and that reciprocate their attention. A study of social interactions within Twitter reveals that the driver of usage is a sparse and hidden network of connections underlying the “declared” set of friends and followers.3

Twitter + Influencer (Retweet)

Want to be Retweeted? Large Scale Analytics on
Factors Impacting Retweet in Twitter Network
http://www2.parc.com/isl/members/hong/publications/socialcomputing2010.pdf

Retweeting is the key mechanism for information
diffusion in Twitter. It emerged as a simple yet powerful way of
disseminating useful information. Even though a lot of
information is shared via its social network structure in Twitter,
little is known yet about how and why certain information
spreads more widely than others. In this paper, we examine a
number of features that might affect retweetability of tweets. We
gathered content and contextual features from 74M tweets and
used this data set to identify factors that are significantly
associated with retweet rate. We also built a predictive retweet
model. We found that, amongst content features, URLs and
hashtags have strong relationships with retweetability. Amongst
contextual features, the number of followers and followees as well
as the age of the account seem to affect retweetability, while,
interestingly, the number of past tweets does not predict
retweetability of a user’s tweet. We believe that this research
would inform the design of sensemaking tools for

Identifying Influencers on Twitter
http://thenoisychannel.com/2011/04/16/identifying-influencers-on-twitter/

More Twitter Analysis: Influencers Don't Retweet
http://www.readwriteweb.com/archives/more_twitter_analysis_influencers_dont_retweet.php

Measuring User Influence in Twitter: The Million Follower Fallacy
http://an.kaist.ac.kr/~mycha/docs/icwsm2010_cha.pdf


Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter
http://research.microsoft.com/pubs/102168/TweetTweetRetweet.pdf