ラベル graphcrest の投稿を表示しています。 すべての投稿を表示
ラベル graphcrest の投稿を表示しています。 すべての投稿を表示

2012年4月12日木曜日

CREST H23 年度 RA 成果報告会

CREST 平成23年度の RA の成果報告会を実施。パターンマッチング、スペクトラルクラスタリング、Random Walk with Restart、協調フィルタリング、ストリーム処理系の5つを5名の RA が実装。いずれも短期間ながら良い成果が出てきたと思います。今年も3名が引き続き、RA を続けていきます。小規模クラスタだけではなく、今年は TSUBAME を使った大規模実験およびそれに向けた最適化を実現したいと思います。

2012年1月7日土曜日

Wikipedia SOM (Self Organizing Map)

Wikipedia SOM

https://kaigi.org/jsai/webprogram/2011/pdf/329.pdf


Web ブラウザを経由して誰でも編集可能なオンライン百科
事典「Wikipedia」は,半構造化されたデータ構造を持ち,幅
広い分野に高い網羅性を持つなどの特徴を持つことから,人
工知能,自然言語処理,Web マイニングをはじめとする各種
の研究分野で,コーパスとして活用されてきた.Wikipedia 上
に公開される情報は日々増加しており,全ての言語を合計する
と,1,800 万以上の記事が存在する.この結果,どの分野にど
の程度の情報が存在し,分野同士がどうつながっているのか,
といったような Wikipedia の全体像を把握することが困難に
なっている.Wikipedia をコーパスとして利用する研究にお
いては,データの特性に応じてアルゴリズムを設計することが
多いため,どのような記事集合がどれほどあり,どのようなク
ラスタがあるのか,クラスタ間の関係はどうなっているのかな
ど,全体を俯瞰することが重要である.また,Wikipedia の閲
覧や編集など一般ユーザとして関わる場合にも,全体を俯瞰す
ることは,不足している情報を把握することや分野同士の関係
性を調べるといった用途において重要であると考えられる.
本研究では,神経細胞移動に着想を得た自己組織化マップ
アルゴリズムの「MIGSOM」[Nakayama 11] をWikipedia に
適用し,全体情報を俯瞰する方法を提案する.MIGSOM は,
大規模な疎データを可視化し,文書マップを作成する技術であ
る.MIGSOM には二つの特徴がある.一つは大規模なデータ
に適用した時にも安定したクラスタリング性能が期待できる点
である.もう一方の特徴は,ズーム機能を利用したクラスタ解
析が可能な点である.これにより,大局的なクラスタと局所的
なクラスタの解析が可能になっ

Graph + Bioinformatics

Causal graph-based analysis of genome-wide association data in rheumatoid arthritis
http://www.biology-direct.com/content/6/1/25

Graph-based methods for analysing networks in cell biology
http://bib.oxfordjournals.org/content/7/3/243.full
Availability of large-scale experimental data for cell biology is enabling computational methods to systematically model the behaviour of cellular networks. This review surveys the recent advances in the field of graph-driven methods for analysing complex cellular networks. The methods are outlined on three levels of increasing complexity, ranging from methods that can characterize global or local structural properties of networks to methods that can detect groups of interconnected nodes, called motifs or clusters, potentially involved in common elementary biological functions. We also briefly summarize recent approaches to data integration and network inference through graph-based formalisms. Finally, we highlight some challenges in the field and offer our personal view of the key future trends and developments in graph-based analysis of large-scale datasets.

Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912890/

The investigation of plant genome structure and evolution requires comprehensive characterization of repetitive sequences that make up the majority of higher plant nuclear DNA. Since genome-wide characterization of repetitive elements is complicated by their high abundance and diversity, novel approaches based on massively-parallel sequencing are being adapted to facilitate the analysis. It has recently been demonstrated that the low-pass genome sequencing provided by a single 454 sequencing reaction is sufficient to capture information about all major repeat families, thus providing the opportunity for efficient repeat investigation in a wide range of species. However, the development of appropriate data mining tools is required in order to fully utilize this sequencing data for repeat characterization.
Results
We adapted a graph-based approach for similarity-based partitioning of whole genome 454 sequence reads in order to build clusters made of the reads derived from individual repeat families. The information about cluster sizes was utilized for assessing the proportion and composition of repeats in the genomes of two model species, Pisum sativum and Glycine max, differing in genome size and 454 sequencing coverage. Moreover, statistical analysis and visual inspection of the topology of the cluster graphs using a newly developed program tool, SeqGrapheR, were shown to be helpful in distinguishing basic types of repeats and investigating sequence variability within repeat families.
Conclusions
Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.

Graph-based data mining for biological applications
https://lirias.kuleuven.be/bitstream/123456789/267094/1/phd_leander_schietgat_final_version.pdf

Graph + Medical Informatics

Graph-based signal integration for patient diagnosis discovery
http://www.biotconf.org/biot2011/BIOT-2011-EA-Herskovic-Bernstam.pdf

Biomedical information problems are often difficult to
solve with computational techniques. One reason for this
difficulty is that many qualitatively different information
types must be combined to solve the problem. For example,
diagnosing a disease requires that information such as
patient demographics, past medical history, disease
prevalence, and current complaints be integrated in some
rational way. In this paper, we describe an approach that
leverages graphs to integrate several information sources to
“diagnose” a patient record

A three-level graph-based model for the management of hospital information systems.
http://www.ncbi.nlm.nih.gov/pubmed/7476470
Information processing in hospitals, especially in university hospitals, is currently faced with two major issues: low-cost hardware and progress in networking technology leads to a further decentralization of computing capacity, due to the increasing need for information processing in hospitals and due to economic restrictions, it is necessary to use commercial software products. This leads to heterogeneous hospital information systems using a variety of software and hardware products, and to a stronger demand for integrating these products and, in general, for a dedicated methodology for the management of hospital information systems to support patient care and medical research. We present a three-level graph-based model (3LGM) to support the systematic management of hospital information systems. 3LGM can serve as a basis for assessing the quality of information processing in hospitals. 3LGM distinguishes between a procedural level for describing the information procedures (and their information interchange) of a hospital information system and thus its functionality, a logical too level, focusing on application systems and communication links, and a physical tool level with physical subsystems (e.g., computer systems) and data transmission. The examples that are presented have been taken from the Heidelberg University Hospital Information System.

Integrated multimedia electronic patient record and grap
http://perso.telecom-paristech.fr/~bloch/papers/CBM-08.pdf
Current electronic patient record (EPR) implementations do not incorporate medical images, nor structural information extracted from them,
despite images increasing role for diagnosis. This paper presents an integration framework into EPRs of anatomical and pathological knowledge
extracted from segmented magnetic resonance imaging (MRI), applying a graph of representation for anatomical and functional information
for individual patients. Focusing on cerebral tumors examination and patient follow-up, multimedia EPRs were created and evaluated through a
3D navigation application, developed with open-source libraries and standards. Results suggest that the enhanced clinical information scheme
could lead to original changes in the way medical experts utilize image-based information.


A Graph-based Bagging
http://www.cse.iitm.ac.in/CoLISD/Papers/4.pdf
The ensemble technique weights several individual classi ers
and combines them to obtain a composite classi er that outperforms each
of them alone. Despite of this technique has been successfully applied in
many areas, in the literature review we did not nd its application on
networked data. To tackle with this lack, we propose a bagging proce-
dure applied to graph-based classi ers. Our contribution was to develop
a bagging procedure for networked data which either maintains or signif-
icantly improves the performance of the tested relational classi ers. Ad-
ditionally, we investigate the role played by diversity among the several
individual classi ers to explain when the technique can be successfully
applied.

Graph-based learning for information systems
http://gradworks.umi.com/33/52/3352368.html


Design and Evaluation of a Temporal, Graph-based Language for Querying Collections of Patient Histories

Giving clinicians and researchers the ability to easily retrieve and explore relevant fragments of patient histories would greatly facilitate quality assurance, patient follow-up and research on patient treatment processes. Established database query languages are inconvenient for such exploration, and may also be too complex for users with limited backgrounds in informatics. We believe that understandability can be increased in return for a sacrifice of some of the power of expression found in general query languages. In order to design a specialized query language, we have collected and synthesized a tentative list of requirements. Based on these requirements, we have designed and implemented Practice Explorer, a prototype for visual query of collections of patient histories, and evaluated the understandability of its query language by testing with medical students. The results indicate that parts of the language are intuitive enough for users to understand without demonstrations, examples, feedback or assistance. They also provide some lessons for future work in this area.


Graph-based Word Sense Disambiguation of Biomedical Documents
http://bioinformatics.oxfordjournals.org/content/early/2010/10/07/bioinformatics.btq555.full.pdf
Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage of text processing. This paper presents a graph-based
approach to WSD in the biomedical domain. The method is unsupervised and does not require any labeled training data. It makes use
of knowledge from the Unified Medical Language System (UMLS)
Metathesaurus which is represented as a graph. A state-of-the-art
algorithm, Personalized PageRank, is used to perform WSD.
Results: When evaluated on the NLM-WSD dataset the algorithm
outperforms other methods that rely on the UMLS Metathesaurus
alone


Graph-Based Shape Analysis for MRI Classification
http://www.igi-global.com/article/international-journal-knowledge-discovery-bioinformatics/62299
Searching for correlations between brain structure and attributes of a person’s intellectual state is a process which may be better done by automation than by human labor. Such an automated system would be capable of performing classification based on the discovered correlation, which would be means of testing how accurate the discovered correlation is. The authors have developed a system which generates a graph-based representation of the shape of the third and lateral ventricles based on a structural MRI, and classifies images represented in this manner. The system is evaluated on accuracy at classifying individuals showing cognitive impairment to Alzheimer’s Disease. Classification accuracy is 74.2% when individuals with CDR 0.5 are included as impaired in a balanced dataset of 166 images, and 79.3% accuracy when differentiating individuals with CDR at least 1.0 and healthy individuals in a balanced dataset of 54 images. Finally, the system is used to classify MR images according to level of education, with 77.2% accuracy differentiating highly-educated individuals from those for whom no higher education is listed, in a balanced dataset of 178 images.


Graph Based Detection of Optic Disc and Fovea in Retinal Images
http://www.inf.unideb.hu/~hajdua/papers/hajduc_2010h.pdf

Diabetic retinopathy (DR) is the damage to the eye's
retina that occurs with long-term diabetes, which can eventually
lead to blindness. Screening programs for DR are being
introduced, however, an important prerequisite for automation is
the accurate localization of the main anatomical features in the
image, notably the optic disc (OD) and the macula. A series of
interesting algorithms have been proposed in the recent past and
the performance is generally good, but each method has
situations, where it fails. This paper presents a novel framework
for automatic detection of optic disc and macula in retinal fundus
images using a combination of different optic disc and macula
detectors represented by a weighted complete graph. A node
pruning procedure removes the worst vertices of the graph by
satisfying predefined geometric constraints and get best possible
detector outputs to be combined using a weighted average. The
extensive tests have shown that combining the predictions of
multiple detectors is more accurate than any of the individual
detectors making up the ensemble.

Power Watershed: A Unifying Graph-Based Optimization Framework
http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2010.200

In this work, we extend a common framework for graph-based image segmentation that includes the graph cuts, random walker, and shortest path optimization algorithms. Viewing an image as a weighted graph, these algorithms can be expressed by means of a common energy function with differing choices of a parameter q acting as an exponent on the differences between neighboring nodes. Introducing a new parameter p that fixes a power for the edge weights allows us to also include the optimal spanning forest algorithm for watershed in this same framework. We then propose a new family of segmentation algorithms that fixes p to produce an optimal spanning forest but varies the power q beyond the usual watershed algorithm, which we term the power watershed. In particular, when q=2, the power watershed leads to a multilabel, scale and contrast invariant, unique global optimum obtained in practice in quasi-linear time. Placing the watershed algorithm in this energy minimization framework also opens new possibilities for using unary terms in traditional watershed segmentation and using watershed to optimize more general models of use in applications beyond image segmentation.


MEDRank: using graph-based concept ranking to index biomedical texts.
http://www.mendeley.com/research/medrank-using-graphbased-concept-ranking-index-biomedical-texts-1/
As the volume of biomedical text increases exponentially, automatic indexing becomes increasingly important. However, existing approaches do not distinguish central (or core) concepts from concepts that were mentioned in passing. We focus on the problem of indexing MEDLINE records, a process that is currently performed by highly trained humans at the National Library of Medicine (NLM). NLM indexers are assisted by a system called the Medical Text Indexer (MTI) that suggests candidate indexing terms.


A semantic graph-based approach to biomedical summarisation
http://dl.acm.org/citation.cfm?id=2016307
Access to the vast body of research literature that is available in biomedicine and related fields may be improved by automatic summarisation. This paper presents a method for summarising biomedical scientific literature that takes into consideration the characteristics of the domain and the type of documents. Methods: To address the problem of identifying salient sentences in biomedical texts, concepts and relations derived from the Unified Medical Language System (UMLS) are arranged to construct a semantic graph that represents the document. A degree-based clustering algorithm is then used to identify different themes or topics within the text. Different heuristics for sentence selection, intended to generate different types of summaries, are tested. A real document case is drawn up to illustrate how the method works. Results: A large-scale evaluation is performed using the recall-oriented understudy for gisting-evaluation (ROUGE) metrics. The results are compared with those achieved by three well-known summarisers (two research prototypes and a commercial application) and two baselines. Our method significantly outperforms all summarisers and baselines. The best of our heuristics achieves an improvement in performance of almost 7.7 percentage units in the ROUGE-1 score over the LexRank summariser (0.7862 versus 0.7302). A qualitative analysis of the summaries also shows that our method succeeds in identifying sentences that cover the main topic of the document and also considers other secondary or ''satellite'' information that might be relevant to the user. Conclusion: The method proposed is proved to be an efficient approach to biomedical literature summarisation, which confirms that the use of concepts rather than terms can be very useful in automatic summarisation, especially when dealing with highly specialised domains.


グラフ構造を用いた時系列関係の発見(9月15日)(<特集>「アクティブマイニング」及び一般)
http://ci.nii.ac.jp/naid/110002664364
医療データを取り扱った知識発見で,最も難しい部分は,不均質に発生する時系列データの扱い方である.本論文では,そのようなデータを取り扱うための,一階述語論理を使った規則発見手法を提案する.提案手法は,時系列データの表現にグラフを採り入れて,そのデータをある規則にしたがって書き換えることで,規則に使われる述語の候補の生成を行う.評価のために,実際の医療データを用いて実験を行い,手法の有効性を示した.

PuReD-MCL: a graph-based PubMed document clustering methodology
http://bioinformatics.oxfordjournals.org/content/24/17/1935.full


A graph-based representation of Gene Expression profiles in DNA microarrays
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4675762
This paper proposes a new and very flexible data model, called gene expression graph (GEG), for genes expression analysis and classification. Three features differentiate GEGs from other available microarray data representation structures: (i) the memory occupation of a GEG is independent of the number of samples used to built it; (ii) a GEG more clearly expresses relationships among expressed and non expressed genes in both healthy and diseased tissues experiments; (iii) GEGs allow to easily implement very efficient classifiers. The paper also presents a simple classifier for sample-based classification to show the flexibility and user-friendliness of the proposed data structure.

GOLEM: an interactive graph-based gene-ontology navigation and analysis tool
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618863/
The Gene Ontology has become an extremely useful tool for the analysis of genomic data and structuring of biological knowledge. Several excellent software tools for navigating the gene ontology have been developed. However, no existing system provides an interactively expandable graph-based view of the gene ontology hierarchy. Furthermore, most existing tools are web-based or require an Internet connection, will not load local annotations files, and provide either analysis or visualization functionality, but not both.



http://www.biomedcentral.com/1471-2105/8/S9/S4/

Graph + Text Analysis

Plagiarism Detection Using Graph-Based Representation
http://www.mendeley.com/research/plagiarism-detection-using-graphbased-representation/

Plagiarism of material from the Internet is a widespread and growing problem. Several methods used to detect the plagiarism and similarity between the source document and suspected documents such as fingerprint based on character or n-gram. In this paper, we discussed a new method to detect the plagiarism based on graph representation; however, Preprocessing for each document is required such as breaking down the document into its constituent sentences. Segmentation of each sentence into separated terms and stop word removal. We build the graph by grouping each sentence terms in one node, the resulted nodes are connected to each other based on order of sentence within the document, all nodes in graph are also connected to top level node "Topic Signature". Topic signature node is formed by extracting the concepts of each sentence terms and grouping them in such node. The main advantage of the proposed method is the topic signature which is main entry for the graph is used as quick guide to the relevant nodes. which should be considered for the comparison between source documents and suspected one. We believe the proposed method can achieve a good performance in terms of effectiveness and efficiency.



テキスト分析における2部グラフクラスタリングの可能性(テキストの類似性・文処理モデル)
Possibilities of the Bipartite Graph Clustering in Text Analysis

http://ci.nii.ac.jp/naid/110004809727
テキスト解析にマルコフクラスタリング(MCL)、およびそれを独自に改良したリカレント・マルコフクラスタリング(RMCL)を利用する場合、有効なデータ取得法として、キーワードと共起語のペアに基づく、2部グラフ化の方法を提案し、MCL-RMCLによる2部グラフクラスタリングの計算結果を、従来のベクトル空間モデルに基づく多変量解析と比較、有効性を検証する.

Graph + CyberSecurity (2)

EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs
http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.203

In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.


Mining and Modeling Real Graphs: Patterns, Generators, Anomalies, and Tools
http://www.cs.cmu.edu/~lakoglu/proposal/lakoglu-proposal.pdf




Community-based anomaly detection in evolutionary networks
http://www.springerlink.com/content/b61165511117u863/
Networks of dynamic systems, including social networks, the World Wide Web, climate networks, and biological networks, can be highly clustered. Detecting clusters, or communities, in such dynamic networks is an emerging area of research; however, less work has been done in terms of detecting community-based anomalies. While there has been some previous work on detecting anomalies in graph-based data, none of these anomaly detection approaches have considered an important property of evolutionary networks—their community structure. In this work, we present an approach to uncover community-based anomalies in evolutionary networks characterized by overlapping communities. We develop a parameter-free and scalable algorithm using a proposed representative-based technique to detect all six possible types of community-based anomalies: grown, shrunken, merged, split, born, and vanished communities. We detail the underlying theory required to guarantee the correctness of the algorithm. We measure the performance of the community-based anomaly detection algorithm by comparison to a non–representative-based algorithm on synthetic networks, and our experiments on synthetic datasets show that our algorithm achieves a runtime speedup of 11–46 over the baseline algorithm. We have also applied our algorithm to two real-world evolutionary networks, Food Web and Enron Email. Significant and informative community-based anomaly dynamics have been detected in both cases.



Using Bayesian Networks for Cyber Security Analysis
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5544924
Capturing the uncertain aspects in cyber security is important for security analysis in enterprise networks. However, there has been insufficient effort in studying what modeling approaches correctly capture such uncertainty, and how to construct the models to make them useful in practice. In this paper, we present our work on justifying uncertainty modeling for cyber security, and initial evidence indicating that it is a useful approach. Our work is centered around near real-time security analysis such as intrusion response. We need to know what is really happening, the scope and severity level, possible consequences, and potential countermeasures. We report our current efforts on identifying the important types of uncertainty and on using Bayesian networks to capture them for enhanced security analysis. We build an example Bayesian network based on a current security graph model, justify our modeling approach through attack semantics and experimental study, and show that the resulting Bayesian network is not sensitive to parameter perturbation.


Applying Graph-Based Anomaly Detection Approaches to the Discovery of Insider Threats
http://eecs.wsu.edu/~holder/pubs/EberleISI09.pdf

The ability to mine data represented as a graph has
become important in several domains for detecting various
structural patterns. One important area of data mining is
anomaly detection, but little work has been done in terms of
detecting anomalies in graph-based data. In this paper we
present graph-based approaches to uncovering anomalies in
applications containing information representing possible insider
threat activity: e-mail, cell-phone calls, and order processing.




Graph-based malware detection using dynamic analysis
http://www.mendeley.com/research/graphbased-malware-detection-using-dynamic-analysis/

Visualizing graph dynamics and similarity for enterprise network security and management
Managing complex enterprise networks requires an understanding at a finer granularity than traditional network monitoring. The ability to correlate and visualize the dynamics and inter-relationships among various network components such as hosts, users, and applications is non-trivial. In this paper, we propose a visualization approach based on the hierarchical structure of similarity/difference visualization in the context of heterogeneous graphs. The concept of hierarchical visualization starts with the evolution of inter-graph states, adapts to the visualization of intra-graph clustering, and concludes with the visualization of similarity between individual nodes. Our visualization tool, ENAVis (Enterprise Network Activities Visualization), quantifies and presents these important changes and dynamics essential to network operators through a visually appealing and highly interactive manner. Through novel graph construction and transformation, such as network connectivity graphs, MDS graphs, bipartite graphs, and similarity graphs, we demonstrate how similarity/dynamics can be effectively visualized to provide insight with regards to network understanding.



A Graph Similarity-based Approach to Security
Event Analysis Using Correlation Techniques

http://dl.acm.org/citation.cfm?id=1850799
—Detecting and identifying security events to provide
cyber situation awareness has become an increasingly important
task within the network research and development community.
We propose a graph similarity-based approach to event detection
and identification that integrates a number of techniques to
collect time-varying situation information, extract correlations
between event attributes, and characterize and identify security
events. Diverging from the traditional rule- or statistical-based
pattern matching techniques, the proposed mechanism represents
security events in a graphical form of correlation networks and
identifies security events through the computation of graph similarity measurements to eliminate the need for constructing user
or system profiles. These technical components take fundamentally different approaches from traditional empirical or statistical
methods and are designed based on rigorous computational
analysis with mathematically proven performance guarantee.
The performance superiority of the proposed mechanism is
demonstrated by extensive simulation and experimental results

Graph Based Statistical Analysis of Network Traffic
http://www.cs.purdue.edu/mlg2011/papers/paper_10.pdf

We propose a method for analyzing tra c data in large com-
puter networks such as big enterprise networks or the In-
ternet. Our approach combines graph theoretical represen-
tation of the data and graph analysis with novel statistical
methods for discovering pattern and time-related anomalies.
We model the tra c as a graph and use temporal charac-
teristics of the data in order to decompose it into subgraphs
corresponding to individual sessions, whose characteristics
are then analyzed using statistical methods. The goal of
that analysis is to discover patterns in the network tra c
data that might indicate intrusion activity or other mali-
cious behavior


Using Graph Theory to Detect Security Policy Violators
http://www.mawode.com/~waltman/talks/plug05emailgraph.pdf

Understanding Multistage Attacks by Attack-Track based Visualization of Heterogeneous Event Streams
http://www.cse.buffalo.edu/~shambhu/documents/pdf/vizsec01s-mathew.pdf

In this paper, we present a method of handling the visualization of hetereogeneous event traffic that is generated by
intrusion detection sensors, log files and other event sources
on a computer network from the point of view of detecting
multistage attack paths that are of importance. We perform
aggregation and correlation of these events based on their semantic content to generate Attack Tracks that are displayed
to the analyst in real-time. Our tool, called the Event Correlation for Cyber-Attack Recognition System (ECCARS) enables the analyst to distinguish and separate an
evolving multistage attack from the thousands of events generated on a network. We focus here on presenting the environment and framework for multistage attack detection
using ECCARS along with screenshots that demonstrate its
capabilities.
Categories an

Scenario Graphs Applied to Network Security
http://www.cs.cmu.edu/~scenariograph/wing07.pdf

Traditional model checking produces one counterexample to illustrate a violation of a property by a
model of the system. Some applications benefit from having all counterexamples, not just one. We call this set of
counterexamples a scenario graph. In this chapter we present two different algorithms for producing scenario graphs
and explain how scenario graphs are a natural representation for attack graphs used in the security community.
Through a detailed concrete example, we show how we can model a computer network and generate and analyze
attack graphs automatically. The attack graph we produce for a network model shows all ways in which an intruder
can violate a given desired security property

Network Security Evaluation through Attack Graph Generation
http://www.waset.org/journals/waset/v54/v54-73.pdf
In today’s network, security evaluation is a challenging
task for most of the administrators. The typical means by which an
attacker breaks into a network is through a series of exploits, where
each exploit in the series satisfies the precondition for subsequent
exploits and makes a causal relationship among them. Such a series of
exploits constitutes an attack path and the set of all possible attack
paths form an attack graph. Even the well administered networks are
susceptible to such attacks as present day vulnerability scanners are
only able to identify the vulnerabilities in isolation but there is a need
for logical formalism and correlation among these vulnerabilities
within a host or across multiple hosts to identify overall risk of the
network. In this paper we propose a network security analysis method
by the generation of network attack graph. After analyzing network
vulnerabilities, linking relation between devices and the characteristic
of attack, the model of network security states is built, and the
generating algorithm of attack graph is implemented. Attack graphs
are important tools for analyzing security vulnerabilities in enterprise
networks. The experiment validates the evaluation method we
proposed.

Graph + CyberSecurity

Making Cyber Security Decisions Through a. Quantitative Metrics Approach
http://eden.dei.uc.pt/~mvieira/raci2011/Sanders_RACI_keynote.pdf


Least Effort Strategies for Cybersecurit
http://arxiv.org/pdf/cond-mat/0306002

Cybersecurity is an issue of increasing concern since the events of September 11
Many questions have been raised concerning the security of the Internet and the rest of US’s
information infrastructure. This paper begins to examine the issue by analyzing the Internet’s
autonomous system (AS) map. Using the AS map, malicious infections are simulated and
different defense strategies are considered in a cost benefit framework. The results show that
protecting the most connected nodes provides significant gains in security and that after the small
minority of the most connected nodes are protected there are diminishing returns for further
protection. Although if parts of the small minority of the most connected firm are not protected,
such as non-US firms, protection levels are significantly decrea

CyberSecurity + Graph

Cyber Security Risks Assessment with Bayesian Defense Graphs and Architectural Models
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4755419

To facilitate rational decision making regarding cyber security investments, decision makers need to be able to assess expected losses before and after potential investments. This paper presents a model based assessment framework for analyzing the cyber security provided by different architectural scenarios. The framework uses the Bayesian statistics based extended influence diagrams to express attack graphs and related countermeasures. In this paper it is demonstrated how this structure can be captured in an abstract model to support analysis based on architectural models. The approach allows calculating the probability that attacks will succeed and the expected loss of these given the instantiated architectural scenario. Moreover, the framework can handle the uncertainties that are accompanied to the analyses. In architectural analysis there are uncertainties acquainted both to the scenario and its properties, as well as to the analysis framework that stipulates how security countermeasures contribute to cyber security.

entrifuge 2.0 - Cyber Security Analysis - Identify Insights with Relationship Graphs
http://www.youtube.com/watch?v=8kJk0mBh8sg
http://www.youtube.com/watch?v=eZqxRGBwIO0&feature=related


Attack Graphs for Proactive Digital Forensics



A Mathematical Basis for Science-Based Cybersecurity

Mining Graph Patterns Efficiently via Randomized Summaries
http://www.cs.uiuc.edu/~hanj/pdf/vldb09_cchen.pdf

Graphs are prevalent in many domains such as Bioinformatics, social networks, Web and cyber-security. Graph pattern
mining has become an important tool in the management
and analysis of complexly structured data, where example
applications include indexing, clustering and classification.
Existing graph mining algorithms have achieved great success by exploiting various properties in the pattern space.
Unfortunately, due to the fundamental role subgraph isomorphism plays in these methods, they may all enter into a
pitfall when the cost to enumerate a huge set of isomorphic
embeddings blows up, especially in large graphs.
The solution we propose for this problem resorts to reduction on the data space. For each graph, we build a summary of it and mine this shrunk graph instead. Compared
to other data reduction techniques that either reduce the
number of transactions or compress between transactions,
this new framework, called Summarize-Mine, suggests a
third path by compressing within transactions. SummarizeMine is effective in cutting down the size of graphs, thus
decreasing the embedding enumeration cost. However, compression might lose patterns at the same time. We address
this issue by generating randomized summaries and repeating the process for multiple rounds, where the main idea is
that true patterns are unlikely to miss from all rounds. We
provide strict probabilistic guarantees on pattern loss likelihood. Experiments on real malware trace data show that
Summarize-Mine is very efficient, which can find interesting malware fingerprints that were not revealed previously


Cyber Security Link Analysis Graph
http://www.analyticbridge.com/photo/cyber-security-link-analysis
This visualization shows interesting patterns of behavior for recent network login traffic. The linkages are between source and destination IPs. The circular stars show one-to-one relationships representing normal behavior. But the unusual pattern in the lower central shows a destintation IP under attack -- it has over 100 hundred source IPs sending it traffic.


Oil Companies Stepping up Cyber Security as Hacking Attacks Increase
http://oilprice.com/Energy/Energy-General/Oil-Companies-Stepping-up-Cyber-Security-as-Hacking-Attacks-Increase.html


An Attack Graph Based Approach for Threat Identification of an Enterprise Network
http://www.igi-global.com/chapter/cyber-security-global-information-assurance/7409

The science of cyber security experimentation: the DETER project
http://dl.acm.org/citation.cfm?id=2076752
Since 2004, the DETER Cyber-security Project has worked to create an evolving infrastructure - facilities, tools, and processes - to provide a national resource for experimentation in cyber security. Building on our insights into requirements for cyber science and on lessons learned through 8 years of operation, we have made several transformative advances towards creating the next generation of DeterLab. These advances in experiment design and research methodology are yielding progressive improvements not only in experiment scale, complexity, diversity, and repeatability, but also in the ability of researchers to leverage prior experimental efforts of other researchers in the DeterLab user community. This paper describes the advances resulting in a new experimentation science and a transformed facility for cybersecurity research development and evaluation.


Blog Data Mining for Cyber Security Threats
http://ewinarko.staff.ugm.ac.id/datamining/tugas2/09-app-blogmining-cybersec.pdf
Blog data mining is a growing research area that addresses the domainspecific problem of extracting information from blog data. In our work, we analyzed
blogs for various categories of cyber threats related to the detection of security
threats and cyber crime. We have extended the Author-Topic model based on Latent Dirichlet Allocation for identify patterns of similarities in keywords and dates
distributed across blog documents. From this model, we visualized the content and
date similarities using the Isomap dimensionality reduction technique. Our findings
support the theory that our probabilistic blog model can present the blogosphere in
terms of topics with measurable keywords, hence aiding the investigative processes
to understand and respond to critical cyber security events and threats.


Insider Threat Detection Using Graph-Based Approach
http://eecs.wsu.edu/~holder/pubs/EberleCATCH09.pdf

Protecting our nation’s cyber infrastructure and
securing sensitive information are critical challenges
for homeland security and require the research,
development and deployment of new technologies
that can be transitioned into the field for combating
cyber security risks. Particular areas of concern are
the deliberate and intended actions associated with
malicious exploitation, theft or destruction of data, or
the compromise of networks, communications or
other IT resources, of which the most harmful and
difficult to detect threats are those propagated by an
insider. However, current efforts to identify
unauthorized access to information, such as what is
found in document control and management systems,
are limited in scope and capabilities.
In order to address this issue, this effort involves
performing further research and development on the
existing Graph-Based Anomaly Detection (GBAD)
system [3]. GBAD discovers anomalous instances of
structural patterns in data that represent entities,
relationships and actions. Input to GBAD is a labeled
graph in which entities are represented by labeled
vertices and relationships or actions are represented
by labeled edges between entities. Using the
minimum description length (MDL) principle to
identify the normative pattern that minimizes the
number of bits needed to describe the input graph
after being compressed by the pattern, GBAD
implements algorithms for identifying the three
possible changes to a graph: modifications,
insertions and deletions. Each algorithm discovers
those substructures that match the closest to the
normative pattern without matching exactly. As a
result, GBAD is looking for those activities that
appear to match normal (or legitimate) transactions,
but in fact are structurally different.
As a solution to the problem of insider threat
detection, we will apply GBAD to datasets that
represent the flow of information between entities, as
well as the actions that take place on the information.
This research involves the representation of datasets,
like a document control and management system, as a
graph, enhancement of GBAD’s performance levels,
and evaluation of GBAD on these datasets. In
previous research, GBAD has already achieved over
95% accuracy detecting anomalies in simulated
domains, with minimal false positives, on graphs of
up to 100,000 vertices.

Twitter の情報信頼性

Information Credibility on Twitter
http://www.www2011india.com/proceeding/proceedings/p675.pdf


We analyze the information credibility of news propagated
through Twitter, a popular microblogging service. Previous
research has shown that most of the messages posted on
Twitter are truthful, but the service is also used to spread
misinformation and false rumors, often unintentionally.
On this paper we focus on automatic methods for assessing the credibility of a given set of tweets. Specifically, we
analyze microblog postings related to “trending” topics, and
classify them as credible or not credible, based on features
extracted from them. We use features from users’ posting
and re-posting (“re-tweeting”) behavior, from the text of the
posts, and from citations to external sources.
We evaluate our methods using a significant number of
human assessments about the credibility of items on a recent
sample of Twitter postings. Our results shows that there are
measurable differences in the way messages propagate, that
can be used to classify them automatically as credible or
not credible, with precision and recall in the range of 70%
to 80%

X-RIME: Hadoop based large scale social network analysis

http://xrime.sourceforge.net/


Today's Internet-based social network sites possess huge user communities. They hold large amount of data about their users and want to generate core competency from the data. A key enabler for this is a cost efficient solution for social data management and social network analysis (SNA). Such a solution faces a few challenges. The most important one is that the solution should be able to handle massive and heterogeneous data sets. Facing this challenge, the traditional data warehouse based solutions are usually not cost efficient enough. On the other hand, existing SNA tools are mostly used in single workstation mode, and not scalable enough. To this end, low cost and highly scalable data management and processing technologies from cloud computing society should be brought in to help. However, most of existing cloud based data analysis solutions are trying to provide SQL-like general purpose query languages, and do not directly support social network analysis. This makes them hard to optimize and hard to use for SNA users. So, we came up with X-RIME to fix this gap. So, briefly speaking, X-RIME wants to provide a few value-added layers on top of existing cloud infrastructure, to support smart decision loops based on massive data sets and SNA. To end users, X-RIME is a library consists of Map-Reduce programs, which are used to do raw data pre-processing, transformation, SNA metrics and structures calculation, and graph / network visualization. The library could be integrated with other Hadoop based data warehouses (e.g., HIVE) to build more comprehensive solutions.

This project is a join effort of Beijing University of Posts and Telecommunications (BUPT) and IBM China Research Lab, supported by IBM Open Collaboration Research program. The code is open source under Apache License V2.0.
Currently Supported SNA Metrics and Structures

vertex degree statistics
weakly connected components (WCC)
strongly connected components (SCC)
bi-connected components (BCC)
ego-centric network density
bread first search / single source shortest path (BFS/SSSP)
K-core
maximal cliques
pagerank
hyperlink-induced topic search (HITS)
minimal spanning tree (MST)
grid variant of Fruchterman-Reingold network layout algorithm

ZAME: Interactive Large-Scale Graph Visualization

http://www.aviz.fr/~fekete/ps/zame.pdf


We present the Zoomable Adjacency Matrix Explorer (ZAME), a
visualization tool for exploring graphs at a scale of millions of
nodes and edges. ZAME is based on an adjacency matrix graph
representation aggregated at multiple scales. It allows analysts to
explore a graph at many levels, zooming and panning with interactive performance from an overview to the most detailed views.
Several components work together in the ZAME tool to make this
possible. Efficient matrix ordering algorithms group related elements. Individual data cases are aggregated into higher-order metarepresentations. Aggregates are arranged into a pyramid hierarchy that allows for on-demand paging to GPU shader programs
to support smooth multiscale browsing. Using ZAME, we are
able to explore the entire French Wikipedia—over 500,000 articles and 6,000,000 links—with interactive performance on standard
consumer-level computer hardware

Gephi

http://oss.infoscience.co.jp/gephi/gephi.org/plugins/graph-streaming/index.html
http://gephi.org/blog/

The purpose of the Graph Streaming API is to build a unified framework for streaming graph objects. Gephi’s data structure and visualization engine has been built with the idea that a graph is not static and might change continuously. By connecting Gephi with external data-sources, we leverage its power to visualize and monitor complex systems or enterprise data in real-time. Moreover, the idea of streaming graph data goes beyond Gephi, and a unified and standardized API could bring interoperability with other available tools for graph and network analysis, as they could start to interoperate with other tools in a distributed and cooperative fashion.

Data Set is available at
http://oss.infoscience.co.jp/gephi/wiki.gephi.org/index.php/Datasets.html

2011年12月17日土曜日

2部グラフ関連性解析の高速化およびシステムプロファイリング

- GPGPU による高速化
- システムプロファイリング:どこにどのぐらい時間を費やしているのか?
- 下位の基盤をSystem S 非依存にし、TSUBAME で大規模実行できるようにする

2011年12月14日水曜日

Random Walks on Directed and Undirected Graphs

http://www.math.ucsd.edu/~phorn/math261/

Random walk with restart に対する高速な Top-k 検索

Random walk with restart に対する高速な Top-k 検索
db-event.jpn.org/deim2011/proceedings/pdf/d3-1.pdf

2011年12月13日火曜日

An In-Depth Study of Stochastic Kronecker Graphs

ICDM に参加中の村田先生からのメッセージ。

"An In-Depth Study of Stochastic Kronecker Graphs"
C. Seshadhri, Ali Pinar, and Tamara G. Kolda,
Proceedings of the 11th IEEE International Conference on
Data Mining, pp.587-596, 2011.
http://arxiv.org/abs/1102.5046

タイトルの通り、(Graph500で用いられる)Kronecker Graphの研究です。発表を聞いただけで論文はまだ読んでいないのですが、
・次数分布はheavy tailでもpower lawでもなく、論文中の図の
ように何度もバウンドしたような曲線になる
・ノイズを入れたnoisy SKGになると次数分布が直線に近づく
・孤立点が多い。k=42で74%が孤立点
・coreのサイズは実ネットワークのそれより小さい

MLG 2010 Workshop

http://www.cs.purdue.edu/mlg2011/speakers.html

RAMGRAPH: large scale graph processing in RAMCloud

http://code.google.com/p/ramgraph/

Recently, the pay-as-you-go computing paradigm in public clouds such as Amazon EC2 offers users to execute their computational tasks on the rented virtualized computation resources. Due to the lower cost than hosting a private cloud, this paradigm has become popular for medium- and small-sized Web companies such as Alexa and SmugMug. Mining and processing the large web and social networks is one of the common regular tasks for those Web companies. For example, a search engine typically uses a graph-based ranking scheme such as PageRank to give an order to the pages. Moreover, those mining and processing tasks are highly customized with various user-defined logics applied on the entire graph. The requirement of high efficiency in large scale graph processing challenges existing tools. This is because, most of current tools such as MapReduce and Hadoop are disk-based, where the hard disk is typically 100-1000x slower than the main memory. The random access nature of graph processing further harnesses the performance of disk-based graph processing. The low efficiency of current disk-based systems limits the popularity of graph mining and business intelligence in Web companies, and can result in low utilization of existing investment and lose potential business opportunities.

To unleash the computation power of current cloud offerings, we propose Thunder, a large graph processing system in the main memories of hundreds or thousands of machines. Thunder stores and processes graphs entirely in the main memory, and uses hard disks only for backup and archrivals. The system provides APIs similar to Map and Reduce functions in MapReduce for users to implement their user-defined logics. These logics are automatically executed on the graph in a distributed manner.

The goal of designing and implementing Thunder is to exploit the advantage of in-memory processing, while remaining all the merits of conventional disk-based tools, namely excellent scalability, good fault-tolerance and ability of expressing arbitrary and complex customized logic. In particular, we are facing challenging issues like scalability, availability, complex memory management, network traffic reducing and so on.

Polonium: Tera-Scale Graph Mining and Inference for Malware Detection

http://siam.omnibooksonline.com/2011datamining/data/papers/052.pdf

We present Polonium, a novel Symantec technology
that detects malware through large-scale graph infer-
ence. Based on the scalable Belief Propagation algo-
rithm, Polonium infers every le's reputation,
agging
les with low reputation as malware. We evaluated
Polonium with a billion-node graph constructed from
the largest le submissions dataset ever published (60
terabytes). Polonium attained a high true positive rate
of 87% in detecting malware; in the eld, Polonium
lifted the detection rate of existing methods by 10 ab-
solute percentage points. We detail Polonium's design
and implementation features instrumental to its success.
Polonium has served 120 million people and helped an-
swer more than one trillion queries for le reputation.

Router: A Message Passing Model for Large-Scale Graph Mining

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6079403&tag=1

ABSTRACT

many parallel computational models have been employed in many papers to process the large-scale graph. In this paper, we propose a message passing model Router which could be invoked by most of current parallel computational models to process the large graph. The model is good at solving the multi-source traversal problem which often occurs in many complex graph algorithms. As the model can traverse the graph from different source at the same time, the multi-source traversal will finish in much less iteration than before. In this way, the total time of the algorithm involves multi-source traversal will be reduced in a large scale. Besides, the Router model is flexible enough to express a broad set of algorithms by implementing the Router's abstract method. Finally, the experiment shows the efficiency and scalability of the model.