2012年1月31日火曜日

X10 関係

- インクリメンタルビルドのツールまたはシェルスクリプト
- ソースコードデバッグ
- Optimization のオプションをつけたときにバグ

2012年1月21日土曜日

Graph research activities in Japan

京都大学 吉川 正俊 先生のグループ
http://kyouindb.iimc.kyoto-u.ac.jp/e/wA3vS

Yasushi Sakurai, Lei Li, Yasuko Matsubara, Christos Faloutsos, ”WindMine: Fast and Effective Mining of Web-click Sequences”, 11th SIAM International Conference on Data Mining (SDM), Mesa, Arizona, April 28-30, 2011.
A Bipartite Graph Model and Mutually Reinforcing Analysis for Review Sites
Towards Improving Wikipedia as an Image-rich Encyclopedia through Analyzing Appropriateness of Images for an Article
Enishi: Searching Knowledge about Relations by Complementarily Utilizing Wikipedia and the Web
Mining and Explaining Relationships in Wikipedia

Graph Database

S. Sakr (2009) “Storing and Querying Graph Data Using Efficient Relational Processing Techniques”. Proceedings of the 8th International Conference on Information Systems Technology and its Applications (ISTA/UNISCON 2009), Sydney, Australia

S. Sakr (2009) “GraphREL: A Decomposition-Based and Selectivity-Aware Relational Framework for Processing Sub-graph Queries”. Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA 2009), Brisbane, Australia

S. Sakr, G. Al-Naymat (2010) “Graph Indexing and Querying: A Review”. International Journal of Web Information Systems (IJWIS) 6(2): 101-120 [PDF] [Citations].

S. Sakr, G. Al-Naymat (2010) “An Efficient Features-Based Processing Technique for Supergraph Queries”. Proceedings of the 14th International Database Engineering and Applications Symposium (IDEAS 2010), Montreal, QC, CANADA, [PDF].

S. Sakr, G. Al-Naymat (2010) “Efficient Relational Techniques for Processing Graph Queries”. Journal of Computer Science and Technology (JCST) 25(6): 1237-1255, [PDF].

Taihei Oshino, Yasuhito Asano, and Masatoshi Yoshikawa
Mining Useful Time Graph Patterns on Extensively Discussed Topics on the We
http://dl.acm.org/citation.cfm?id=1880857

Graph Data Management: Techniques and Applications [Hardcover]
Sherif Sakr (Author, Editor), Eric Pardede (Editor)

Accepted papers in GDM2012

A Comparison of Current Graph Database Models
Benchmarking Traversal Operations Over Graph Databases
Mining Associations Using Directed Hypergraphs
Design of Declarative Graph Query Languages: On the Choice between Value, Pattern and Object based Representations for Graphs
Finding Skyline Nodes in Large Networks (Invited Paper)
Partitioning Social Networks for Fast Retrieval of Time-dependent Queries (Invited Paper)
Will Graph Data Management Techniques Contribute to the Successful Large-Scale Deployment of Semantic Web Technologies?
Graph-based Knowledge Representation: Computational Foundations of Conceptual Graphs (Advanced Information and Knowledge Processing) [Paperback]

2012年1月11日水曜日

NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce

NIMBLE: A Toolkit for the Implementation of Parallel Data
Mining and Machine Learning Algorithms on MapReduce
http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p334.pdf


In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and
data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the
Message Passing Interface (MPI) have been the de facto choices
for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level
and ill-suited for implementing ML-DM algorithms. To address
this deficiency, we present NIMBLE, a portable infrastructure that
has been specifically designed to enable the rapid implementation
of parallel ML-DM algorithms. The infrastructure allows one to
compose parallel ML-DM algorithms using reusable (serial and
parallel) building blocks that can be efficiently executed using MR
and other parallel programming models; it currently runs on top of
Hadoop, which is an open-source MR implementation. We show
how NIMBLE can be used to realize scalable implementations of
ML-DM algorithms and present a performance evaluation

2012年1月7日土曜日

Wikipedia SOM (Self Organizing Map)

Wikipedia SOM

https://kaigi.org/jsai/webprogram/2011/pdf/329.pdf


Web ブラウザを経由して誰でも編集可能なオンライン百科
事典「Wikipedia」は,半構造化されたデータ構造を持ち,幅
広い分野に高い網羅性を持つなどの特徴を持つことから,人
工知能,自然言語処理,Web マイニングをはじめとする各種
の研究分野で,コーパスとして活用されてきた.Wikipedia 上
に公開される情報は日々増加しており,全ての言語を合計する
と,1,800 万以上の記事が存在する.この結果,どの分野にど
の程度の情報が存在し,分野同士がどうつながっているのか,
といったような Wikipedia の全体像を把握することが困難に
なっている.Wikipedia をコーパスとして利用する研究にお
いては,データの特性に応じてアルゴリズムを設計することが
多いため,どのような記事集合がどれほどあり,どのようなク
ラスタがあるのか,クラスタ間の関係はどうなっているのかな
ど,全体を俯瞰することが重要である.また,Wikipedia の閲
覧や編集など一般ユーザとして関わる場合にも,全体を俯瞰す
ることは,不足している情報を把握することや分野同士の関係
性を調べるといった用途において重要であると考えられる.
本研究では,神経細胞移動に着想を得た自己組織化マップ
アルゴリズムの「MIGSOM」[Nakayama 11] をWikipedia に
適用し,全体情報を俯瞰する方法を提案する.MIGSOM は,
大規模な疎データを可視化し,文書マップを作成する技術であ
る.MIGSOM には二つの特徴がある.一つは大規模なデータ
に適用した時にも安定したクラスタリング性能が期待できる点
である.もう一方の特徴は,ズーム機能を利用したクラスタ解
析が可能な点である.これにより,大局的なクラスタと局所的
なクラスタの解析が可能になっ

Graph + Bioinformatics

Causal graph-based analysis of genome-wide association data in rheumatoid arthritis
http://www.biology-direct.com/content/6/1/25

Graph-based methods for analysing networks in cell biology
http://bib.oxfordjournals.org/content/7/3/243.full
Availability of large-scale experimental data for cell biology is enabling computational methods to systematically model the behaviour of cellular networks. This review surveys the recent advances in the field of graph-driven methods for analysing complex cellular networks. The methods are outlined on three levels of increasing complexity, ranging from methods that can characterize global or local structural properties of networks to methods that can detect groups of interconnected nodes, called motifs or clusters, potentially involved in common elementary biological functions. We also briefly summarize recent approaches to data integration and network inference through graph-based formalisms. Finally, we highlight some challenges in the field and offer our personal view of the key future trends and developments in graph-based analysis of large-scale datasets.

Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912890/

The investigation of plant genome structure and evolution requires comprehensive characterization of repetitive sequences that make up the majority of higher plant nuclear DNA. Since genome-wide characterization of repetitive elements is complicated by their high abundance and diversity, novel approaches based on massively-parallel sequencing are being adapted to facilitate the analysis. It has recently been demonstrated that the low-pass genome sequencing provided by a single 454 sequencing reaction is sufficient to capture information about all major repeat families, thus providing the opportunity for efficient repeat investigation in a wide range of species. However, the development of appropriate data mining tools is required in order to fully utilize this sequencing data for repeat characterization.
Results
We adapted a graph-based approach for similarity-based partitioning of whole genome 454 sequence reads in order to build clusters made of the reads derived from individual repeat families. The information about cluster sizes was utilized for assessing the proportion and composition of repeats in the genomes of two model species, Pisum sativum and Glycine max, differing in genome size and 454 sequencing coverage. Moreover, statistical analysis and visual inspection of the topology of the cluster graphs using a newly developed program tool, SeqGrapheR, were shown to be helpful in distinguishing basic types of repeats and investigating sequence variability within repeat families.
Conclusions
Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.

Graph-based data mining for biological applications
https://lirias.kuleuven.be/bitstream/123456789/267094/1/phd_leander_schietgat_final_version.pdf

Graph + Medical Informatics

Graph-based signal integration for patient diagnosis discovery
http://www.biotconf.org/biot2011/BIOT-2011-EA-Herskovic-Bernstam.pdf

Biomedical information problems are often difficult to
solve with computational techniques. One reason for this
difficulty is that many qualitatively different information
types must be combined to solve the problem. For example,
diagnosing a disease requires that information such as
patient demographics, past medical history, disease
prevalence, and current complaints be integrated in some
rational way. In this paper, we describe an approach that
leverages graphs to integrate several information sources to
“diagnose” a patient record

A three-level graph-based model for the management of hospital information systems.
http://www.ncbi.nlm.nih.gov/pubmed/7476470
Information processing in hospitals, especially in university hospitals, is currently faced with two major issues: low-cost hardware and progress in networking technology leads to a further decentralization of computing capacity, due to the increasing need for information processing in hospitals and due to economic restrictions, it is necessary to use commercial software products. This leads to heterogeneous hospital information systems using a variety of software and hardware products, and to a stronger demand for integrating these products and, in general, for a dedicated methodology for the management of hospital information systems to support patient care and medical research. We present a three-level graph-based model (3LGM) to support the systematic management of hospital information systems. 3LGM can serve as a basis for assessing the quality of information processing in hospitals. 3LGM distinguishes between a procedural level for describing the information procedures (and their information interchange) of a hospital information system and thus its functionality, a logical too level, focusing on application systems and communication links, and a physical tool level with physical subsystems (e.g., computer systems) and data transmission. The examples that are presented have been taken from the Heidelberg University Hospital Information System.

Integrated multimedia electronic patient record and grap
http://perso.telecom-paristech.fr/~bloch/papers/CBM-08.pdf
Current electronic patient record (EPR) implementations do not incorporate medical images, nor structural information extracted from them,
despite images increasing role for diagnosis. This paper presents an integration framework into EPRs of anatomical and pathological knowledge
extracted from segmented magnetic resonance imaging (MRI), applying a graph of representation for anatomical and functional information
for individual patients. Focusing on cerebral tumors examination and patient follow-up, multimedia EPRs were created and evaluated through a
3D navigation application, developed with open-source libraries and standards. Results suggest that the enhanced clinical information scheme
could lead to original changes in the way medical experts utilize image-based information.


A Graph-based Bagging
http://www.cse.iitm.ac.in/CoLISD/Papers/4.pdf
The ensemble technique weights several individual classi ers
and combines them to obtain a composite classi er that outperforms each
of them alone. Despite of this technique has been successfully applied in
many areas, in the literature review we did not nd its application on
networked data. To tackle with this lack, we propose a bagging proce-
dure applied to graph-based classi ers. Our contribution was to develop
a bagging procedure for networked data which either maintains or signif-
icantly improves the performance of the tested relational classi ers. Ad-
ditionally, we investigate the role played by diversity among the several
individual classi ers to explain when the technique can be successfully
applied.

Graph-based learning for information systems
http://gradworks.umi.com/33/52/3352368.html


Design and Evaluation of a Temporal, Graph-based Language for Querying Collections of Patient Histories

Giving clinicians and researchers the ability to easily retrieve and explore relevant fragments of patient histories would greatly facilitate quality assurance, patient follow-up and research on patient treatment processes. Established database query languages are inconvenient for such exploration, and may also be too complex for users with limited backgrounds in informatics. We believe that understandability can be increased in return for a sacrifice of some of the power of expression found in general query languages. In order to design a specialized query language, we have collected and synthesized a tentative list of requirements. Based on these requirements, we have designed and implemented Practice Explorer, a prototype for visual query of collections of patient histories, and evaluated the understandability of its query language by testing with medical students. The results indicate that parts of the language are intuitive enough for users to understand without demonstrations, examples, feedback or assistance. They also provide some lessons for future work in this area.


Graph-based Word Sense Disambiguation of Biomedical Documents
http://bioinformatics.oxfordjournals.org/content/early/2010/10/07/bioinformatics.btq555.full.pdf
Word Sense Disambiguation (WSD), automatically identifying the meaning of ambiguous words in context, is an important stage of text processing. This paper presents a graph-based
approach to WSD in the biomedical domain. The method is unsupervised and does not require any labeled training data. It makes use
of knowledge from the Unified Medical Language System (UMLS)
Metathesaurus which is represented as a graph. A state-of-the-art
algorithm, Personalized PageRank, is used to perform WSD.
Results: When evaluated on the NLM-WSD dataset the algorithm
outperforms other methods that rely on the UMLS Metathesaurus
alone


Graph-Based Shape Analysis for MRI Classification
http://www.igi-global.com/article/international-journal-knowledge-discovery-bioinformatics/62299
Searching for correlations between brain structure and attributes of a person’s intellectual state is a process which may be better done by automation than by human labor. Such an automated system would be capable of performing classification based on the discovered correlation, which would be means of testing how accurate the discovered correlation is. The authors have developed a system which generates a graph-based representation of the shape of the third and lateral ventricles based on a structural MRI, and classifies images represented in this manner. The system is evaluated on accuracy at classifying individuals showing cognitive impairment to Alzheimer’s Disease. Classification accuracy is 74.2% when individuals with CDR 0.5 are included as impaired in a balanced dataset of 166 images, and 79.3% accuracy when differentiating individuals with CDR at least 1.0 and healthy individuals in a balanced dataset of 54 images. Finally, the system is used to classify MR images according to level of education, with 77.2% accuracy differentiating highly-educated individuals from those for whom no higher education is listed, in a balanced dataset of 178 images.


Graph Based Detection of Optic Disc and Fovea in Retinal Images
http://www.inf.unideb.hu/~hajdua/papers/hajduc_2010h.pdf

Diabetic retinopathy (DR) is the damage to the eye's
retina that occurs with long-term diabetes, which can eventually
lead to blindness. Screening programs for DR are being
introduced, however, an important prerequisite for automation is
the accurate localization of the main anatomical features in the
image, notably the optic disc (OD) and the macula. A series of
interesting algorithms have been proposed in the recent past and
the performance is generally good, but each method has
situations, where it fails. This paper presents a novel framework
for automatic detection of optic disc and macula in retinal fundus
images using a combination of different optic disc and macula
detectors represented by a weighted complete graph. A node
pruning procedure removes the worst vertices of the graph by
satisfying predefined geometric constraints and get best possible
detector outputs to be combined using a weighted average. The
extensive tests have shown that combining the predictions of
multiple detectors is more accurate than any of the individual
detectors making up the ensemble.

Power Watershed: A Unifying Graph-Based Optimization Framework
http://www.computer.org/portal/web/csdl/doi/10.1109/TPAMI.2010.200

In this work, we extend a common framework for graph-based image segmentation that includes the graph cuts, random walker, and shortest path optimization algorithms. Viewing an image as a weighted graph, these algorithms can be expressed by means of a common energy function with differing choices of a parameter q acting as an exponent on the differences between neighboring nodes. Introducing a new parameter p that fixes a power for the edge weights allows us to also include the optimal spanning forest algorithm for watershed in this same framework. We then propose a new family of segmentation algorithms that fixes p to produce an optimal spanning forest but varies the power q beyond the usual watershed algorithm, which we term the power watershed. In particular, when q=2, the power watershed leads to a multilabel, scale and contrast invariant, unique global optimum obtained in practice in quasi-linear time. Placing the watershed algorithm in this energy minimization framework also opens new possibilities for using unary terms in traditional watershed segmentation and using watershed to optimize more general models of use in applications beyond image segmentation.


MEDRank: using graph-based concept ranking to index biomedical texts.
http://www.mendeley.com/research/medrank-using-graphbased-concept-ranking-index-biomedical-texts-1/
As the volume of biomedical text increases exponentially, automatic indexing becomes increasingly important. However, existing approaches do not distinguish central (or core) concepts from concepts that were mentioned in passing. We focus on the problem of indexing MEDLINE records, a process that is currently performed by highly trained humans at the National Library of Medicine (NLM). NLM indexers are assisted by a system called the Medical Text Indexer (MTI) that suggests candidate indexing terms.


A semantic graph-based approach to biomedical summarisation
http://dl.acm.org/citation.cfm?id=2016307
Access to the vast body of research literature that is available in biomedicine and related fields may be improved by automatic summarisation. This paper presents a method for summarising biomedical scientific literature that takes into consideration the characteristics of the domain and the type of documents. Methods: To address the problem of identifying salient sentences in biomedical texts, concepts and relations derived from the Unified Medical Language System (UMLS) are arranged to construct a semantic graph that represents the document. A degree-based clustering algorithm is then used to identify different themes or topics within the text. Different heuristics for sentence selection, intended to generate different types of summaries, are tested. A real document case is drawn up to illustrate how the method works. Results: A large-scale evaluation is performed using the recall-oriented understudy for gisting-evaluation (ROUGE) metrics. The results are compared with those achieved by three well-known summarisers (two research prototypes and a commercial application) and two baselines. Our method significantly outperforms all summarisers and baselines. The best of our heuristics achieves an improvement in performance of almost 7.7 percentage units in the ROUGE-1 score over the LexRank summariser (0.7862 versus 0.7302). A qualitative analysis of the summaries also shows that our method succeeds in identifying sentences that cover the main topic of the document and also considers other secondary or ''satellite'' information that might be relevant to the user. Conclusion: The method proposed is proved to be an efficient approach to biomedical literature summarisation, which confirms that the use of concepts rather than terms can be very useful in automatic summarisation, especially when dealing with highly specialised domains.


グラフ構造を用いた時系列関係の発見(9月15日)(<特集>「アクティブマイニング」及び一般)
http://ci.nii.ac.jp/naid/110002664364
医療データを取り扱った知識発見で,最も難しい部分は,不均質に発生する時系列データの扱い方である.本論文では,そのようなデータを取り扱うための,一階述語論理を使った規則発見手法を提案する.提案手法は,時系列データの表現にグラフを採り入れて,そのデータをある規則にしたがって書き換えることで,規則に使われる述語の候補の生成を行う.評価のために,実際の医療データを用いて実験を行い,手法の有効性を示した.

PuReD-MCL: a graph-based PubMed document clustering methodology
http://bioinformatics.oxfordjournals.org/content/24/17/1935.full


A graph-based representation of Gene Expression profiles in DNA microarrays
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4675762
This paper proposes a new and very flexible data model, called gene expression graph (GEG), for genes expression analysis and classification. Three features differentiate GEGs from other available microarray data representation structures: (i) the memory occupation of a GEG is independent of the number of samples used to built it; (ii) a GEG more clearly expresses relationships among expressed and non expressed genes in both healthy and diseased tissues experiments; (iii) GEGs allow to easily implement very efficient classifiers. The paper also presents a simple classifier for sample-based classification to show the flexibility and user-friendliness of the proposed data structure.

GOLEM: an interactive graph-based gene-ontology navigation and analysis tool
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618863/
The Gene Ontology has become an extremely useful tool for the analysis of genomic data and structuring of biological knowledge. Several excellent software tools for navigating the gene ontology have been developed. However, no existing system provides an interactively expandable graph-based view of the gene ontology hierarchy. Furthermore, most existing tools are web-based or require an Internet connection, will not load local annotations files, and provide either analysis or visualization functionality, but not both.



http://www.biomedcentral.com/1471-2105/8/S9/S4/

Graph + Text Analysis

Plagiarism Detection Using Graph-Based Representation
http://www.mendeley.com/research/plagiarism-detection-using-graphbased-representation/

Plagiarism of material from the Internet is a widespread and growing problem. Several methods used to detect the plagiarism and similarity between the source document and suspected documents such as fingerprint based on character or n-gram. In this paper, we discussed a new method to detect the plagiarism based on graph representation; however, Preprocessing for each document is required such as breaking down the document into its constituent sentences. Segmentation of each sentence into separated terms and stop word removal. We build the graph by grouping each sentence terms in one node, the resulted nodes are connected to each other based on order of sentence within the document, all nodes in graph are also connected to top level node "Topic Signature". Topic signature node is formed by extracting the concepts of each sentence terms and grouping them in such node. The main advantage of the proposed method is the topic signature which is main entry for the graph is used as quick guide to the relevant nodes. which should be considered for the comparison between source documents and suspected one. We believe the proposed method can achieve a good performance in terms of effectiveness and efficiency.



テキスト分析における2部グラフクラスタリングの可能性(テキストの類似性・文処理モデル)
Possibilities of the Bipartite Graph Clustering in Text Analysis

http://ci.nii.ac.jp/naid/110004809727
テキスト解析にマルコフクラスタリング(MCL)、およびそれを独自に改良したリカレント・マルコフクラスタリング(RMCL)を利用する場合、有効なデータ取得法として、キーワードと共起語のペアに基づく、2部グラフ化の方法を提案し、MCL-RMCLによる2部グラフクラスタリングの計算結果を、従来のベクトル空間モデルに基づく多変量解析と比較、有効性を検証する.

Graph + CyberSecurity (2)

EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs
http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.203

In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.


Mining and Modeling Real Graphs: Patterns, Generators, Anomalies, and Tools
http://www.cs.cmu.edu/~lakoglu/proposal/lakoglu-proposal.pdf




Community-based anomaly detection in evolutionary networks
http://www.springerlink.com/content/b61165511117u863/
Networks of dynamic systems, including social networks, the World Wide Web, climate networks, and biological networks, can be highly clustered. Detecting clusters, or communities, in such dynamic networks is an emerging area of research; however, less work has been done in terms of detecting community-based anomalies. While there has been some previous work on detecting anomalies in graph-based data, none of these anomaly detection approaches have considered an important property of evolutionary networks—their community structure. In this work, we present an approach to uncover community-based anomalies in evolutionary networks characterized by overlapping communities. We develop a parameter-free and scalable algorithm using a proposed representative-based technique to detect all six possible types of community-based anomalies: grown, shrunken, merged, split, born, and vanished communities. We detail the underlying theory required to guarantee the correctness of the algorithm. We measure the performance of the community-based anomaly detection algorithm by comparison to a non–representative-based algorithm on synthetic networks, and our experiments on synthetic datasets show that our algorithm achieves a runtime speedup of 11–46 over the baseline algorithm. We have also applied our algorithm to two real-world evolutionary networks, Food Web and Enron Email. Significant and informative community-based anomaly dynamics have been detected in both cases.



Using Bayesian Networks for Cyber Security Analysis
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5544924
Capturing the uncertain aspects in cyber security is important for security analysis in enterprise networks. However, there has been insufficient effort in studying what modeling approaches correctly capture such uncertainty, and how to construct the models to make them useful in practice. In this paper, we present our work on justifying uncertainty modeling for cyber security, and initial evidence indicating that it is a useful approach. Our work is centered around near real-time security analysis such as intrusion response. We need to know what is really happening, the scope and severity level, possible consequences, and potential countermeasures. We report our current efforts on identifying the important types of uncertainty and on using Bayesian networks to capture them for enhanced security analysis. We build an example Bayesian network based on a current security graph model, justify our modeling approach through attack semantics and experimental study, and show that the resulting Bayesian network is not sensitive to parameter perturbation.


Applying Graph-Based Anomaly Detection Approaches to the Discovery of Insider Threats
http://eecs.wsu.edu/~holder/pubs/EberleISI09.pdf

The ability to mine data represented as a graph has
become important in several domains for detecting various
structural patterns. One important area of data mining is
anomaly detection, but little work has been done in terms of
detecting anomalies in graph-based data. In this paper we
present graph-based approaches to uncovering anomalies in
applications containing information representing possible insider
threat activity: e-mail, cell-phone calls, and order processing.




Graph-based malware detection using dynamic analysis
http://www.mendeley.com/research/graphbased-malware-detection-using-dynamic-analysis/

Visualizing graph dynamics and similarity for enterprise network security and management
Managing complex enterprise networks requires an understanding at a finer granularity than traditional network monitoring. The ability to correlate and visualize the dynamics and inter-relationships among various network components such as hosts, users, and applications is non-trivial. In this paper, we propose a visualization approach based on the hierarchical structure of similarity/difference visualization in the context of heterogeneous graphs. The concept of hierarchical visualization starts with the evolution of inter-graph states, adapts to the visualization of intra-graph clustering, and concludes with the visualization of similarity between individual nodes. Our visualization tool, ENAVis (Enterprise Network Activities Visualization), quantifies and presents these important changes and dynamics essential to network operators through a visually appealing and highly interactive manner. Through novel graph construction and transformation, such as network connectivity graphs, MDS graphs, bipartite graphs, and similarity graphs, we demonstrate how similarity/dynamics can be effectively visualized to provide insight with regards to network understanding.



A Graph Similarity-based Approach to Security
Event Analysis Using Correlation Techniques

http://dl.acm.org/citation.cfm?id=1850799
—Detecting and identifying security events to provide
cyber situation awareness has become an increasingly important
task within the network research and development community.
We propose a graph similarity-based approach to event detection
and identification that integrates a number of techniques to
collect time-varying situation information, extract correlations
between event attributes, and characterize and identify security
events. Diverging from the traditional rule- or statistical-based
pattern matching techniques, the proposed mechanism represents
security events in a graphical form of correlation networks and
identifies security events through the computation of graph similarity measurements to eliminate the need for constructing user
or system profiles. These technical components take fundamentally different approaches from traditional empirical or statistical
methods and are designed based on rigorous computational
analysis with mathematically proven performance guarantee.
The performance superiority of the proposed mechanism is
demonstrated by extensive simulation and experimental results

Graph Based Statistical Analysis of Network Traffic
http://www.cs.purdue.edu/mlg2011/papers/paper_10.pdf

We propose a method for analyzing tra c data in large com-
puter networks such as big enterprise networks or the In-
ternet. Our approach combines graph theoretical represen-
tation of the data and graph analysis with novel statistical
methods for discovering pattern and time-related anomalies.
We model the tra c as a graph and use temporal charac-
teristics of the data in order to decompose it into subgraphs
corresponding to individual sessions, whose characteristics
are then analyzed using statistical methods. The goal of
that analysis is to discover patterns in the network tra c
data that might indicate intrusion activity or other mali-
cious behavior


Using Graph Theory to Detect Security Policy Violators
http://www.mawode.com/~waltman/talks/plug05emailgraph.pdf

Understanding Multistage Attacks by Attack-Track based Visualization of Heterogeneous Event Streams
http://www.cse.buffalo.edu/~shambhu/documents/pdf/vizsec01s-mathew.pdf

In this paper, we present a method of handling the visualization of hetereogeneous event traffic that is generated by
intrusion detection sensors, log files and other event sources
on a computer network from the point of view of detecting
multistage attack paths that are of importance. We perform
aggregation and correlation of these events based on their semantic content to generate Attack Tracks that are displayed
to the analyst in real-time. Our tool, called the Event Correlation for Cyber-Attack Recognition System (ECCARS) enables the analyst to distinguish and separate an
evolving multistage attack from the thousands of events generated on a network. We focus here on presenting the environment and framework for multistage attack detection
using ECCARS along with screenshots that demonstrate its
capabilities.
Categories an

Scenario Graphs Applied to Network Security
http://www.cs.cmu.edu/~scenariograph/wing07.pdf

Traditional model checking produces one counterexample to illustrate a violation of a property by a
model of the system. Some applications benefit from having all counterexamples, not just one. We call this set of
counterexamples a scenario graph. In this chapter we present two different algorithms for producing scenario graphs
and explain how scenario graphs are a natural representation for attack graphs used in the security community.
Through a detailed concrete example, we show how we can model a computer network and generate and analyze
attack graphs automatically. The attack graph we produce for a network model shows all ways in which an intruder
can violate a given desired security property

Network Security Evaluation through Attack Graph Generation
http://www.waset.org/journals/waset/v54/v54-73.pdf
In today’s network, security evaluation is a challenging
task for most of the administrators. The typical means by which an
attacker breaks into a network is through a series of exploits, where
each exploit in the series satisfies the precondition for subsequent
exploits and makes a causal relationship among them. Such a series of
exploits constitutes an attack path and the set of all possible attack
paths form an attack graph. Even the well administered networks are
susceptible to such attacks as present day vulnerability scanners are
only able to identify the vulnerabilities in isolation but there is a need
for logical formalism and correlation among these vulnerabilities
within a host or across multiple hosts to identify overall risk of the
network. In this paper we propose a network security analysis method
by the generation of network attack graph. After analyzing network
vulnerabilities, linking relation between devices and the characteristic
of attack, the model of network security states is built, and the
generating algorithm of attack graph is implemented. Attack graphs
are important tools for analyzing security vulnerabilities in enterprise
networks. The experiment validates the evaluation method we
proposed.

Graph + CyberSecurity

Making Cyber Security Decisions Through a. Quantitative Metrics Approach
http://eden.dei.uc.pt/~mvieira/raci2011/Sanders_RACI_keynote.pdf


Least Effort Strategies for Cybersecurit
http://arxiv.org/pdf/cond-mat/0306002

Cybersecurity is an issue of increasing concern since the events of September 11
Many questions have been raised concerning the security of the Internet and the rest of US’s
information infrastructure. This paper begins to examine the issue by analyzing the Internet’s
autonomous system (AS) map. Using the AS map, malicious infections are simulated and
different defense strategies are considered in a cost benefit framework. The results show that
protecting the most connected nodes provides significant gains in security and that after the small
minority of the most connected nodes are protected there are diminishing returns for further
protection. Although if parts of the small minority of the most connected firm are not protected,
such as non-US firms, protection levels are significantly decrea

Graph-based Data Mining

Graph-based data mining
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=850825
Using databases represented as graphs, the Subdue system performs two key data mining techniques: unsupervised pattern discovery and supervised concept learning from examples. Applications to large structural databases demonstrate Subdue's scalability and effectiveness

CyberSecurity + Graph

Cyber Security Risks Assessment with Bayesian Defense Graphs and Architectural Models
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4755419

To facilitate rational decision making regarding cyber security investments, decision makers need to be able to assess expected losses before and after potential investments. This paper presents a model based assessment framework for analyzing the cyber security provided by different architectural scenarios. The framework uses the Bayesian statistics based extended influence diagrams to express attack graphs and related countermeasures. In this paper it is demonstrated how this structure can be captured in an abstract model to support analysis based on architectural models. The approach allows calculating the probability that attacks will succeed and the expected loss of these given the instantiated architectural scenario. Moreover, the framework can handle the uncertainties that are accompanied to the analyses. In architectural analysis there are uncertainties acquainted both to the scenario and its properties, as well as to the analysis framework that stipulates how security countermeasures contribute to cyber security.

entrifuge 2.0 - Cyber Security Analysis - Identify Insights with Relationship Graphs
http://www.youtube.com/watch?v=8kJk0mBh8sg
http://www.youtube.com/watch?v=eZqxRGBwIO0&feature=related


Attack Graphs for Proactive Digital Forensics



A Mathematical Basis for Science-Based Cybersecurity

Mining Graph Patterns Efficiently via Randomized Summaries
http://www.cs.uiuc.edu/~hanj/pdf/vldb09_cchen.pdf

Graphs are prevalent in many domains such as Bioinformatics, social networks, Web and cyber-security. Graph pattern
mining has become an important tool in the management
and analysis of complexly structured data, where example
applications include indexing, clustering and classification.
Existing graph mining algorithms have achieved great success by exploiting various properties in the pattern space.
Unfortunately, due to the fundamental role subgraph isomorphism plays in these methods, they may all enter into a
pitfall when the cost to enumerate a huge set of isomorphic
embeddings blows up, especially in large graphs.
The solution we propose for this problem resorts to reduction on the data space. For each graph, we build a summary of it and mine this shrunk graph instead. Compared
to other data reduction techniques that either reduce the
number of transactions or compress between transactions,
this new framework, called Summarize-Mine, suggests a
third path by compressing within transactions. SummarizeMine is effective in cutting down the size of graphs, thus
decreasing the embedding enumeration cost. However, compression might lose patterns at the same time. We address
this issue by generating randomized summaries and repeating the process for multiple rounds, where the main idea is
that true patterns are unlikely to miss from all rounds. We
provide strict probabilistic guarantees on pattern loss likelihood. Experiments on real malware trace data show that
Summarize-Mine is very efficient, which can find interesting malware fingerprints that were not revealed previously


Cyber Security Link Analysis Graph
http://www.analyticbridge.com/photo/cyber-security-link-analysis
This visualization shows interesting patterns of behavior for recent network login traffic. The linkages are between source and destination IPs. The circular stars show one-to-one relationships representing normal behavior. But the unusual pattern in the lower central shows a destintation IP under attack -- it has over 100 hundred source IPs sending it traffic.


Oil Companies Stepping up Cyber Security as Hacking Attacks Increase
http://oilprice.com/Energy/Energy-General/Oil-Companies-Stepping-up-Cyber-Security-as-Hacking-Attacks-Increase.html


An Attack Graph Based Approach for Threat Identification of an Enterprise Network
http://www.igi-global.com/chapter/cyber-security-global-information-assurance/7409

The science of cyber security experimentation: the DETER project
http://dl.acm.org/citation.cfm?id=2076752
Since 2004, the DETER Cyber-security Project has worked to create an evolving infrastructure - facilities, tools, and processes - to provide a national resource for experimentation in cyber security. Building on our insights into requirements for cyber science and on lessons learned through 8 years of operation, we have made several transformative advances towards creating the next generation of DeterLab. These advances in experiment design and research methodology are yielding progressive improvements not only in experiment scale, complexity, diversity, and repeatability, but also in the ability of researchers to leverage prior experimental efforts of other researchers in the DeterLab user community. This paper describes the advances resulting in a new experimentation science and a transformed facility for cybersecurity research development and evaluation.


Blog Data Mining for Cyber Security Threats
http://ewinarko.staff.ugm.ac.id/datamining/tugas2/09-app-blogmining-cybersec.pdf
Blog data mining is a growing research area that addresses the domainspecific problem of extracting information from blog data. In our work, we analyzed
blogs for various categories of cyber threats related to the detection of security
threats and cyber crime. We have extended the Author-Topic model based on Latent Dirichlet Allocation for identify patterns of similarities in keywords and dates
distributed across blog documents. From this model, we visualized the content and
date similarities using the Isomap dimensionality reduction technique. Our findings
support the theory that our probabilistic blog model can present the blogosphere in
terms of topics with measurable keywords, hence aiding the investigative processes
to understand and respond to critical cyber security events and threats.


Insider Threat Detection Using Graph-Based Approach
http://eecs.wsu.edu/~holder/pubs/EberleCATCH09.pdf

Protecting our nation’s cyber infrastructure and
securing sensitive information are critical challenges
for homeland security and require the research,
development and deployment of new technologies
that can be transitioned into the field for combating
cyber security risks. Particular areas of concern are
the deliberate and intended actions associated with
malicious exploitation, theft or destruction of data, or
the compromise of networks, communications or
other IT resources, of which the most harmful and
difficult to detect threats are those propagated by an
insider. However, current efforts to identify
unauthorized access to information, such as what is
found in document control and management systems,
are limited in scope and capabilities.
In order to address this issue, this effort involves
performing further research and development on the
existing Graph-Based Anomaly Detection (GBAD)
system [3]. GBAD discovers anomalous instances of
structural patterns in data that represent entities,
relationships and actions. Input to GBAD is a labeled
graph in which entities are represented by labeled
vertices and relationships or actions are represented
by labeled edges between entities. Using the
minimum description length (MDL) principle to
identify the normative pattern that minimizes the
number of bits needed to describe the input graph
after being compressed by the pattern, GBAD
implements algorithms for identifying the three
possible changes to a graph: modifications,
insertions and deletions. Each algorithm discovers
those substructures that match the closest to the
normative pattern without matching exactly. As a
result, GBAD is looking for those activities that
appear to match normal (or legitimate) transactions,
but in fact are structurally different.
As a solution to the problem of insider threat
detection, we will apply GBAD to datasets that
represent the flow of information between entities, as
well as the actions that take place on the information.
This research involves the representation of datasets,
like a document control and management system, as a
graph, enhancement of GBAD’s performance levels,
and evaluation of GBAD on these datasets. In
previous research, GBAD has already achieved over
95% accuracy detecting anomalies in simulated
domains, with minimal false positives, on graphs of
up to 100,000 vertices.

Twitter + Lifelog

A location predictor based on dependencies between multiple lifelog data
http://dl.acm.org/citation.cfm?id=1867702


Towards trajectory-based experience sharing in a city
http://dl.acm.org/citation.cfm?id=2063221

Twitter as Economic Indicators

Twitter as Economic Indicators

http://dealmakersguide.com/twitter-as-an-economic-indicator/
http://professional.wsj.com/article/SB10001424052970204138204576598942105167646.html


Modeling public mood and emotion:
Twitter sentiment and socio-economic phenomena
http://arxiv.org/pdf/0911.1583

Twitter

Twitter under the microscope - First Monday

http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2317/206
Scholars, advertisers and political activists see massive online social networks as a representation of social interactions that can be used to study the propagation of ideas, social bond dynamics and viral marketing, among others. But the linked structures of social networks do not reveal actual interactions among people. Scarcity of attention and the daily rhythms of life and work makes people default to interacting with those few that matter and that reciprocate their attention. A study of social interactions within Twitter reveals that the driver of usage is a sparse and hidden network of connections underlying the “declared” set of friends and followers.3

Twitter + Influencer (Retweet)

Want to be Retweeted? Large Scale Analytics on
Factors Impacting Retweet in Twitter Network
http://www2.parc.com/isl/members/hong/publications/socialcomputing2010.pdf

Retweeting is the key mechanism for information
diffusion in Twitter. It emerged as a simple yet powerful way of
disseminating useful information. Even though a lot of
information is shared via its social network structure in Twitter,
little is known yet about how and why certain information
spreads more widely than others. In this paper, we examine a
number of features that might affect retweetability of tweets. We
gathered content and contextual features from 74M tweets and
used this data set to identify factors that are significantly
associated with retweet rate. We also built a predictive retweet
model. We found that, amongst content features, URLs and
hashtags have strong relationships with retweetability. Amongst
contextual features, the number of followers and followees as well
as the age of the account seem to affect retweetability, while,
interestingly, the number of past tweets does not predict
retweetability of a user’s tweet. We believe that this research
would inform the design of sensemaking tools for

Identifying Influencers on Twitter
http://thenoisychannel.com/2011/04/16/identifying-influencers-on-twitter/

More Twitter Analysis: Influencers Don't Retweet
http://www.readwriteweb.com/archives/more_twitter_analysis_influencers_dont_retweet.php

Measuring User Influence in Twitter: The Million Follower Fallacy
http://an.kaist.ac.kr/~mycha/docs/icwsm2010_cha.pdf


Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter
http://research.microsoft.com/pubs/102168/TweetTweetRetweet.pdf

Twitter + 災害

Twitter in Disaster Mode:
http://people.ee.ethz.ch/~hossmath/papers/woid_twimight.pdf

Recent natural disasters (earthquakes, floods, etc.) have shown
that people heavily use platforms like Twitter to communicate and organize in emergencies. However, the fixed infrastructure supporting such communications may be temporarily wiped out. In such situations, the phones’ capabilities of
infrastructure-less communication can fill in: By propagating data opportunistically (from phone to phone), tweets can
still be spread, yet at the cost of delays.
In this paper, we present Twimight and its network security extensions. Twimight is an open source Twitter client
for Android phones featured with a “disaster mode”, which
users enable upon losing connectivity. In the disaster mode,
tweets are not sent to the Twitter server but stored on the
phone, carried around as people move, and forwarded via
Bluetooth when in proximity with other phones. However,
switching from an online centralized application to a distributed and delay-tolerant service relying on opportunistic
communication requires rethinking the security architecture.
We propose security extensions to offer comparable security in the disaster mode as in the normal mode to protect
Twimight from basic attacks. We also propose a simple,
yet efficient, anti-spam scheme to avoid users from being
flooded with spam. Finally, we present a preliminary empirical performance evaluation of Twimight.
Categories and Subject Descriptors

Twitter の情報信頼性

Information Credibility on Twitter
http://www.www2011india.com/proceeding/proceedings/p675.pdf


We analyze the information credibility of news propagated
through Twitter, a popular microblogging service. Previous
research has shown that most of the messages posted on
Twitter are truthful, but the service is also used to spread
misinformation and false rumors, often unintentionally.
On this paper we focus on automatic methods for assessing the credibility of a given set of tweets. Specifically, we
analyze microblog postings related to “trending” topics, and
classify them as credible or not credible, based on features
extracted from them. We use features from users’ posting
and re-posting (“re-tweeting”) behavior, from the text of the
posts, and from citations to external sources.
We evaluate our methods using a significant number of
human assessments about the credibility of items on a recent
sample of Twitter postings. Our results shows that there are
measurable differences in the way messages propagate, that
can be used to classify them automatically as credible or
not credible, with precision and recall in the range of 70%
to 80%

Twitter + イベント検出 (2)

TwitterMonitor: Trend Detection over the Twitter Stream
http://queens.db.toronto.edu/~mathiou//TwitterMonitor.pdf

We present TwitterMonitor, a system that performs trend
detection over the Twitter stream. The system identifies
emerging topics (i.e. ‘trends’) on Twitter in real time and
provides meaningful analytics that synthesize an accurate
description of each topic. Users interact with the system
by ordering the identified trends using different criteria and
submitting their own description for each trend.
We discuss the motivation for trend detection over social media streams and the challenges that lie therein. We
then describe our approach to trend detection, as well as
the architecture of TwitterMonitor. Finally, we lay out our
demonstration scenario.

X-RIME: Hadoop based large scale social network analysis

http://xrime.sourceforge.net/


Today's Internet-based social network sites possess huge user communities. They hold large amount of data about their users and want to generate core competency from the data. A key enabler for this is a cost efficient solution for social data management and social network analysis (SNA). Such a solution faces a few challenges. The most important one is that the solution should be able to handle massive and heterogeneous data sets. Facing this challenge, the traditional data warehouse based solutions are usually not cost efficient enough. On the other hand, existing SNA tools are mostly used in single workstation mode, and not scalable enough. To this end, low cost and highly scalable data management and processing technologies from cloud computing society should be brought in to help. However, most of existing cloud based data analysis solutions are trying to provide SQL-like general purpose query languages, and do not directly support social network analysis. This makes them hard to optimize and hard to use for SNA users. So, we came up with X-RIME to fix this gap. So, briefly speaking, X-RIME wants to provide a few value-added layers on top of existing cloud infrastructure, to support smart decision loops based on massive data sets and SNA. To end users, X-RIME is a library consists of Map-Reduce programs, which are used to do raw data pre-processing, transformation, SNA metrics and structures calculation, and graph / network visualization. The library could be integrated with other Hadoop based data warehouses (e.g., HIVE) to build more comprehensive solutions.

This project is a join effort of Beijing University of Posts and Telecommunications (BUPT) and IBM China Research Lab, supported by IBM Open Collaboration Research program. The code is open source under Apache License V2.0.
Currently Supported SNA Metrics and Structures

vertex degree statistics
weakly connected components (WCC)
strongly connected components (SCC)
bi-connected components (BCC)
ego-centric network density
bread first search / single source shortest path (BFS/SSSP)
K-core
maximal cliques
pagerank
hyperlink-induced topic search (HITS)
minimal spanning tree (MST)
grid variant of Fruchterman-Reingold network layout algorithm

ZAME: Interactive Large-Scale Graph Visualization

http://www.aviz.fr/~fekete/ps/zame.pdf


We present the Zoomable Adjacency Matrix Explorer (ZAME), a
visualization tool for exploring graphs at a scale of millions of
nodes and edges. ZAME is based on an adjacency matrix graph
representation aggregated at multiple scales. It allows analysts to
explore a graph at many levels, zooming and panning with interactive performance from an overview to the most detailed views.
Several components work together in the ZAME tool to make this
possible. Efficient matrix ordering algorithms group related elements. Individual data cases are aggregated into higher-order metarepresentations. Aggregates are arranged into a pyramid hierarchy that allows for on-demand paging to GPU shader programs
to support smooth multiscale browsing. Using ZAME, we are
able to explore the entire French Wikipedia—over 500,000 articles and 6,000,000 links—with interactive performance on standard
consumer-level computer hardware

生物学的ネットワークへのグラフ技術の応用

[GEXF] Diseasome: 既知の障害遺伝子との関連によって結び付けられた障害/疾患遺伝子のネットワークで、多くの疾病の遺伝的原因が示されています。類似する障害に関連する遺伝子では、生成物質間の物理的相互作用発生の可能性が高いと当時に、転写の発現プロファイリングの類似性も高いことが示されています。オリジナルデータセットはここにあります。The Human Disease Network, Goh K-I, Cusick ME, Valle D, Childs B, Vidal M, Barabási A-L (2007), Proc Natl Acad Sci USA 104:8685-8690
[GEXF] C. Elegans の神経網: C. Elegans (線虫の一種) の神経網を表した、重み付けされた有向ネットワーク。D. Watts と S. Strogatz によって作成されたデータは Web から入手することができ、ここにあります。引用時には次の出典を明記してください: D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998)。オリジナルの実験データの出所: J. G. White, E. Southgate, J. N. Thompson, and S. Brenner, Phil. Trans. R. Soc. London 314, 1-340 (1986).
[GEXF] イースト: イースト菌のタンパク質間相互作用。オリジナルデータはここにあります。
[訳注:原文に欠落部分があるようです]オリジナルデータはここにあります。

コード解析へのグラフ技術の応用

http://oss.infoscience.co.jp/gephi/wiki.gephi.org/index.php/Datasets.html
[GEXF] Java コード: Java プログラムのソースコード構造。S.Heymann & J.Palmier, 2008 による。
[GEXF] 動的 Java コード: SVN 上でのコミットの進行による Java プログラムの動的ソースコード構造。S.Heymann & J.Bilcke, 2008 による。
[TGZ] Github オープンソース開発者: http://lumberjaph.net/blog/index.php/2010/03/25/github-explorer/ を参照してください。

スマートグリッド

http://oss.infoscience.co.jp/gephi/wiki.gephi.org/index.php/Datasets.html
電力網: 重み付けのない無向ネットワークで、米国西部諸州電力網のトポロジーを表しています。D. Watts と S. Strogatz によって作成されたデータは Web から入手することができ、ここにあります。引用時には次の出典を明記してください: D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998)。

Gephi

http://oss.infoscience.co.jp/gephi/gephi.org/plugins/graph-streaming/index.html
http://gephi.org/blog/

The purpose of the Graph Streaming API is to build a unified framework for streaming graph objects. Gephi’s data structure and visualization engine has been built with the idea that a graph is not static and might change continuously. By connecting Gephi with external data-sources, we leverage its power to visualize and monitor complex systems or enterprise data in real-time. Moreover, the idea of streaming graph data goes beyond Gephi, and a unified and standardized API could bring interoperability with other available tools for graph and network analysis, as they could start to interoperate with other tools in a distributed and cooperative fashion.

Data Set is available at
http://oss.infoscience.co.jp/gephi/wiki.gephi.org/index.php/Datasets.html

Twitter +

Twitter Power: Tweets as Electronic Word of Mouth
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.3321

In this paper we report research results investigating microblogging as a form of electronic word-of-mouth for sharing consumer opinions concerning brands. We analyzed more than 150,000 microblog postings containing branding comments, sentiments, and opinions.We investigated the overall structure of these microblog postings, the types of expressions, and the movement in positive or negative sentiment.We compared automated methods of classifying sentiment in these microblogs with manual coding. Using a case study approach, we analyzed the range, frequency, timing, and content of tweets in a corporate account. Our research findings show that 19% of microblogs contain mention of a brand. Of the branding microblogs, nearly 20 % contained some expression of brand sentiments. Of these, more than 50 % were positive and 33 % were critical of the company or product. Our comparison of automated and manual coding showed no significant differences between the two approaches. In analyzing microblogs for structure and composition, the linguistic structure of tweets approximate the linguistic patterns of natural language expressions. We find that microblogging is an online tool for customer word of mouth communications and discuss the implications for corporations using microblogging as part of their overall marketing strategy.

Twitter + 感情分析 (金融への応用含む)

Twitter mood predicts the stock market

http://arxiv.org/abs/1010.3003

Behavioral economics tells us that emotions can profoundly affect individual behavior and decision-making. Does this also apply to societies at large, i.e., can societies experience mood states that affect their collective decision making? By extension is the public mood correlated or even predictive of economic indicators? Here we investigate whether measurements of collective mood states derived from large-scale Twitter feeds are correlated to the value of the Dow Jones Industrial Average (DJIA) over time. We analyze the text content of daily Twitter feeds by two mood tracking tools, namely OpinionFinder that measures positive vs. negative mood and Google-Profile of Mood States (GPOMS) that measures mood in terms of 6 dimensions (Calm, Alert, Sure, Vital, Kind, and Happy). We cross-validate the resulting mood time series by comparing their ability to detect the public's response to the presidential election and Thanksgiving day in 2008. A Granger causality analysis and a Self-Organizing Fuzzy Neural Network are then used to investigate the hypothesis that public mood states, as measured by the OpinionFinder and GPOMS mood time series, are predictive of changes in DJIA closing values. Our results indicate that the accuracy of DJIA predictions can be significantly improved by the inclusion of specific public mood dimensions but not others. We find an accuracy of 87.6% in predicting the daily up and down changes in the closing values of the DJIA and a reduction of the Mean Average Percentage Error by more than 6%.



Younggue Bae , Hongchul Lee, A sentiment analysis of audiences on twitter: who is the positive or negative audience of popular twitterers?, Proceedings of the 5th international conference on Convergence and hybrid information technology, September 22-24, 2011, Daejeon, Korea
http://dl.acm.org/citation.cfm?id=1944594&CFID=76892041&CFTOKEN=90820401

Automated identification of diverse sentiment types can be beneficial for many NLP systems such as review summarization and public media analysis. In some of these systems there is an option of assigning a sentiment value to a single sentence or a very short text.In this paper we propose a supervised sentiment classification framework which is based on data from Twitter, a popular microblogging service. By utilizing 50 Twitter tags and 15 smileys as sentiment labels, this framework avoids the need for labor intensive manual annotation, allowing identification and classification of diverse sentiment types of short texts. We evaluate the contribution of different feature types for sentiment classification and show that our framework successfully identifies sentiment types of untagged sentences. The quality of the sentiment identification was also confirmed by human judges. We also explore dependencies and overlap between different sentiment types represented by smileys and Twitter hashtags.
http://dl.acm.org/citation.cfm?id=1944594&CFID=76892041&CFTOKEN=90820401


Effective sentiment stream analysis with self-augmenting training and demand-driven projection
http://dl.acm.org/citation.cfm?id=2009981&CFID=76892041&CFTOKEN=90820401
How do we analyze sentiments over a set of opinionated Twitter messages? This issue has been widely studied in recent years, with a prominent approach being based on the application of classification techniques. Basically, messages are classified according to the implicit attitude of the writer with respect to a query term. A major concern, however, is that Twitter (and other media channels) follows the data stream model, and thus the classifier must operate with limited resources, including labeled data for training classification models. This imposes serious challenges for current classification techniques, since they need to be constantly fed with fresh training messages, in order to track sentiment drift and to provide up-to-date sentiment analysis.

We propose solutions to this problem. The heart of our approach is a training augmentation procedure which takes as input a small training seed, and then it automatically incorporates new relevant messages to the training data. Classification models are produced on-the-fly using association rules, which are kept up-to-date in an incremental fashion, so that at any given time the model properly reflects the sentiments in the event being analyzed. In order to track sentiment drift, training messages are projected on a demand driven basis, according to the content of the message being classified. Projecting the training data offers a series of advantages, including the ability to quickly detect trending information emerging in the stream. We performed the analysis of major events in 2010, and we show that the prediction performance remains about the same, or even increases, as the stream passes and new training messages are acquired. This result holds for different languages, even in cases where sentiment distribution changes over time, or in cases where the initial training seed is rather small. We derive lower-bounds for prediction performance, and we show that our approach is extremely effective under diverse learning scenarios, providing gains that range from 7% to 58%.


Taketoshi Ushiama , Tomoya Eguchi, An information recommendation agent on microblogging service, Proceedings of the 5th KES international conference on Agent and multi-agent systems: technologies and applications, June 29-July 01, 2011, Manchester, UK

Bernard J. Jansen , Kate Sobel , Geoff Cook, Gen X and Ys attitudes on using social media platforms for opinion sharing, Proceedings of the 28th of the international conference extended abstracts on Human factors in computing systems, April 10-15, 2010, Atlanta, Georgia, USA

Tiffany C. Chao, Data repositories: a home for microblog archives?, Proceedings of the 2011 iConference, p.655-656, February 08-11, 2011, Seattle, Washington

Wei Wu , Bin Zhang , Mari Ostendorf, Automatic generation of personalized annotation tags for Twitter users, Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, p.689-692, June 02-04, 2010, Los Angeles, California

Daniel M. Romero , Wojciech Galuba , Sitaram Asur , Bernardo A. Huberman, Influence and passivity in social media, Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases, September 05-09, 2011, Athens, Greece

Younggue Bae , Hongchul Lee, A sentiment analysis of audiences on twitter: who is the positive or negative audience of popular twitterers?, Proceedings of the 5th international conference on Convergence and hybrid information technology, September 22-24, 2011, Daejeon, Korea

Kamran Massoudi , Manos Tsagkias , Maarten de Rijke , Wouter Weerkamp, Incorporating query expansion and quality indicators in searching microblog posts, Proceedings of the 33rd European conference on Advances in information retrieval, April 18-21, 2011, Dublin, Ireland

Bernard J. Jansen , Kate Sobel , Geoff Cook, Being networked and being engaged: the impact of social networking on ecommerce information behavior, Proceedings of the 2011 iConference, p.130-136, February 08-11, 2011, Seattle, Washington

Dmitry Davidov , Oren Tsur , Ari Rappoport, Enhanced sentiment learning using Twitter hashtags and smileys, Proceedings of the 23rd International Conference on Computational Linguistics: Posters, p.241-249, August 23-27, 2010, Beijing, China

Claudia Wagner , Markus Strohmaier, The wisdom in tweetonomies: acquiring latent conceptual structures from social awareness streams, Proceedings of the 3rd International Semantic Search Workshop, p.1-10, April 26-26, 2010, Raleigh, North Carolina

Matthew Michelson , Sofus A. Macskassy, Discovering users' topics of interest on twitter: a first look, Proceedings of the fourth workshop on Analytics for noisy unstructured text data, October 26-26, 2010, Toronto, ON, Canada

Surender Reddy Yerva , Zoltán Miklós , Karl Aberer, What have fruits to do with technology?: the case of Orange, Blackberry and Apple, Proceedings of the International Conference on Web Intelligence, Mining and Semantics, May 25-27, 2011, Sogndal, Norway

Jingtao Wang , Shumin Zhai , John Canny, SHRIMP: solving collision and out of vocabulary problems in mobile predictive input with motion gesture, Proceedings of the 28th international conference on Human factors in computing systems, April 10-15, 2010, Atlanta, Georgia, USA

Takeshi Sakaki , Makoto Okazaki , Yutaka Matsuo, Earthquake shakes Twitter users: real-time event detection by social sensors, Proceedings of the 19th international conference on World wide web, April 26-30, 2010, Raleigh, North Carolina, USA

Sheila Kinsella , Mengjiao Wang , John G. Breslin , Conor Hayes, Improving categorisation in social media using hyperlinks to structured data sources, Proceedings of the 8th extended semantic web conference on The semanic web: research and applications, May 29-June 02, 2011, Heraklion, Crete, Greece

Jennifer Golbeck , Justin M. Grimes , Anthony Rogers, Twitter use by the U.S. Congress, Journal of the American Society for Information Science and Technology, v.61 n.8, p.1612-1621, August 2010

Jun Huang , Mizuho Iwaihara, Realtime social sensing of support rate for microblogging, Proceedings of the 16th international conference on Database systems for advanced applications, April 22-25, 2011, Hong Kong, China

Makoto Okazaki , Yutaka Matsuo, Semantic twitter: analyzing tweets for real-time event notification, Proceedings of the 2008/2009 international conference on Social software: recent trends and developments in social software, p.63-74, March 03-04, 2008, Cork, Ireland

Thomas Heverin , Lisl Zach, Twitter for city police department information sharing, Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem, October 22-27, 2010, Pittsburgh, Pennsylvania

Pedro Henrique Calais Guerra , Adriano Veloso , Wagner Meira, Jr. , Virgílio Almeida, From bias to opinion: a transfer-learning approach to real-time sentiment analysis, Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, August 21-24, 2011, San Diego, California, USA

Miles Efron , Megan Winget, Questions are content: a taxonomy of questions in a microblogging environment, Proceedings of the 73rd ASIS&T Annual Meeting on Navigating Streams in an Information Ecosystem, October 22-27, 2010, Pittsburgh, Pennsylvania

Anlei Dong , Ruiqiang Zhang , Pranam Kolari , Jing Bai , Fernando Diaz , Yi Chang , Zhaohui Zheng , Hongyuan Zha, Time is of the essence: improving recency ranking using Twitter data, Proceedings of the 19th international conference on World wide web, April 26-30, 2010, Raleigh, North Carolina, USA

Yegin Genc , Yasuaki Sakamoto , Jeffrey V. Nickerson, Discovering context: classifying tweets through a semantic transform based on wikipedia, Proceedings of the 6th international conference on Foundations of augmented cognition: directing the future of adaptive systems, July 09-14, 2011, Orlando, FL

Evangelos Kalampokis , Michael Hausenblas , Konstantinos Tarabanis, Combining social and government open data for participatory decision-making, Proceedings of the Third IFIP WG 8.5 international conference on Electronic participation, August 29-September 01, 2011, Delft, The Netherlands

Ismael Santana Silva , Janaína Gomide , Adriano Veloso , Wagner Meira, Jr. , Renato Ferreira, Effective sentiment stream analysis with self-augmenting training and demand-driven projection, Proceedings of the 34th international ACM SIGIR conference on Research and development in Information, July 24-28, 2011, Beijing, China

Bernard J. Jansen , Zhe Liu , Courtney Weaver , Gerry Campbell , Matthew Gregg, Real time search on the web: Queries, topics, and economic value, Information Processing and Management: an International Journal, v.47 n.4, p.491-506, July, 2011

Marti Motoyama , Brendan Meeder , Kirill Levchenko , Geoffrey M. Voelker , Stefan Savage, Measuring online service availability using twitter, Proceedings of the 3rd conference on Online social networks, p.13-13, June 22-25, 2010, Boston, MA

Zi Chu , Steven Gianvecchio , Haining Wang , Sushil Jajodia, Who is tweeting on Twitter: human, bot, or cyborg?, Proceedings of the 26th Annual Computer Security Applications Conference, December 06-10, 2010, Austin, Texas

Mike Thelwall , Kevan Buckley , Georgios Paltoglou, Sentiment in Twitter events, Journal of the American Society for Information Science and Technology, v.62 n.2, p.406-418, February 2011

Laurens De Vocht , Selver Softic , Martin Ebner , Herbert Mühlburger, Semantically driven social data aggregation interfaces for Research 2.0, Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies, September 07-09, 2011, Graz, Austria

Twitter + イベント検出

Extracting Events and Event Descriptions from Twitter
http://www.www2011india.com/proceeding/companion/p105.pdf

This paper describes methods for automatically detecting events
involving known entities from Twitter and understanding both the
events as well as the audience reaction to them. We show that NLP
techniques can be used to extract events, their main actors and the
audience reactions with encouraging results.


Earthquake Shakes Twitter Users:Real-time Event Detection by Social Sensors
http://ymatsuo.com/papers/www2010.pdf


Streaming First Story Detection with application to Twitter
http://aclweb.org/anthology/N/N10/N10-1021.pdf

With the recent rise in popularity and size of
social media, there is a growing need for systems that can extract useful information from
this amount of data. We address the problem of detecting new events from a stream of
Twitter posts. To make event detection feasible on web-scale corpora, we present an algorithm based on locality-sensitive hashing which is able overcome the limitations of traditional
approaches, while maintaining competitive results. In particular, a comparison with a stateof-the-art system on the first story detection
task shows that we achieve over an order of
magnitude speedup in processing time, while
retaining comparable performance. Event detection experiments on a collection of 160 million Twitter posts show that celebrity deaths
are the fastest spreading news on Twitter.


Detecting Controversial Events from Twitter
http://www.marcopennacchiotti.com/pro/publications/CIKM_2010.pdf
Social media provides researchers with continuously updated
information about developments of interest to large audi-
ences. This paper addresses the task of identifying contro-
versial events using Twitter as a starting point: we propose
3 models for this task and report encouraging initial results.


以下、http://hashimomau.chips.jp/weblogs/?page_id=182より抜粋

題名:Why We Twitter: Understanding Microblogging Usage and Communities
出典:Joint 9th WEBKDD and 1st SNA-KDD Workshop ’07
概要・特徴:
twitterを分析した論文(初期なのでは??)
twitterユーザの世界分布、ユーザ数の増加など
twitterユーザの行動分析
ユーザは情報収集、日常の他愛もないつぶやき、他のユーザとのコミュニケーションをしている
HITSアルゴリズムを用いてコミュニティ分析もしている
感想:この論文は引用されることが多く、twitterを研究される方は取りあえず、読んでおくと良いです!
リンク:http://ebiquity.umbc.edu/paper/html/id/367/Why-We-Twitter-Understanding-Microblogging-Usage-and-Communities
題名:TwitterRank:Finding Topic-sensitive Influential Twitterers
出典:WSDM 2010
概要・特徴:
Twitter内の有力なユーザを抽出する方法を提案
TwitterRank:ページランクのアルゴリズムを拡張して実現。実験の結果より、他のアルゴリズムと比較して良い評価を得た
Twitterユーザに“homophily”が存在することを示し、“フォロー”関係に趣向(話題)の類似性があることを証明した
感想:フォロー関係に話題の類似性があることを学術的?統計的に示した論文。技術系論文では初らしい。PageRankを用いているため、有力ユーザ(有名ユーザ)の検出には効果がありそう。しかし、どの辺がTopic-sensitiveなのか分からなかったです。
リンク:http://www.mysmu.edu/staff/jsweng/papers/TwitterRank_WSDM.pdf


題名:Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment
出典:Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media 2010
概要・特徴:
twitter×選挙(政治)の関係を示した論文
twitterの投稿数とドイツ議会選挙との関係を調査した論文(投稿数から選挙結果を予想している)
LIWC(投稿の感情を分析するツール)を用いてセンチメント分析し、当時の政治状況と考察している
感想:投稿数から選挙結果を導けるかどうか疑問。しかし、選挙とtwitterの投稿を自然言語処理を用いて導いていると論文は初?投稿(ドイツ語)→英語に変換→LIWC(投稿の感情を分析するツール)を用いている。これはドイツ語と英語の互換性が高いから出来る事。日本語じゃ無理ぽ。 ※日本だとWEBと政治の関係を示した論文が極端に少ない気がします。
リンク: http://videolectures.net/icwsm2010_sprenger_pet/
題名:Tweet the Debates Understanding Community Annotation of Uncollected Sources
出典:WSM’09
概要・特徴:
アメリカ大統領選挙中の党首討論の議論の流れと、同時期のtwitterの議論の流れを解析した研究
twitterの投稿から動画のアノテーションを試みる研究
党首討論はTV放送で中継されており、中継中の字幕放送を議論の流れとしている
twitterの投稿は、ハッシュタグを使用しているtweetなどを利用している
感想:投稿からアノテーションをする研究は興味深かったです。
リンク: http://videolectures.net/icwsm2010_sprenger_pet/
題名:Microblogging During Two Natural Hazards Events: What Twitter May Contribute to Situational Awareness
出典:CHI 2010: Crisis Informatics
概要・特徴:
マイクロブログの性質を利用して災害時における情報伝播のメカニズム?を調べた研究
実際の災害時におけるデータを使用している
the Oklahoma Grassfires of April 2009 and the Red River Floods that occurred in March and April 2009
感想:災害時にマイクロブログのリアルタイム性の高さが役つということが分かった論文。具体的な手法は覚えてないが、そのような結論だったはず。マイクロブログの特徴が社会に役に立つと思った論文でした。
リンク: http://delivery.acm.org/10.1145/1760000/1753486/p1079-vieweg.pdf?key1=1753486&key2=5717842921&coll=DL&dl=ACM&CFID=2537635&CFTOKEN=65603928
題名:Twitter mood predicts the stock market.
出典:今後どこかに投稿するらしい 2010現在
概要・特徴:
Twitterの投稿からNYダウ平均株価(DJIA)のアップダウンを87.6%の精度で予測
解析対象
We only take into account tweets that contain explicit statements of their author’s mood states.
“I feel”, “i am feeling”, “i’m feeling”, “i don’t feel”, “I’m”, “Im”, “I am”, and “makes me”.
上記解析対象をP/N(肯定/否定)判定器のOpinionFinder(OF)とセンチメント分析器のGoogle-Profile of Mood States(GPOMS)両者の差分値を示した
GPOMSは投稿を Calm, Alert, Sure, Vital, Kind, Happy に分類できる
また、両解析結果を2つの市場予想モデル?に適応している
Bivariate Granger Causality Analysis 、Self-organizing Fuzzy Neural Network (SOFNN) モデル
SOFNNモデルCalmに着目することで、 87.6%の精度を実現した

2012年1月6日金曜日

Twitter + 医療

Towards detecting influenza epidemics by analyzing Twitter messages
http://snap.stanford.edu/soma2010/papers/soma2010_16.pdf

Rapid response to a health epidemic is critical to reduce loss
of life. Existing methods mostly rely on expensive surveys
of hospitals across the country, typically with lag times of
one to two weeks for influenza reporting, and even longer
for less common diseases. In response, there have been
several recently proposed solutions to estimate a population’s health from Internet activity, most notably Google’s
Flu Trends service, which correlates search term frequency
with influenza statistics reported by the Centers for Disease
Control and Prevention (CDC). In this paper, we analyze
messages posted on the micro-blogging site Twitter.com to
determine if a similar correlation can be uncovered. We
propose several methods to identify influenza-related messages and compare a number of regression models to correlate these messages with CDC statistics. Using over 500,000
messages spanning 10 weeks, we find that our best model
achieves a correlation of .78 with CDC statistics by leveraging a document classifier to identify relevant messages.



Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter
http://www.aclweb.org/anthology/D/D11/D11-1145.pdf



With the recent rise in popularity and  scale 
of social media, a growing need exists for 
systems  that can extract useful information 
from huge amounts of data. We address the 
issue  of  detecting  influenza  epidemics. 
First, the proposed system extracts influenza related tweets using Twitter API. Then, 
only tweets that mention actual influenza 
patients are extracted by the support vector 
machine (SVM) based classifier. The experiment results demonstrate the feasibility 
of the proposed approach (0.89 correlation 
to the gold standard). Especially at the outbreak and early spread (early epidemic 
stage), the proposed method shows high 
correlation  (0.97 correlation), which  outperforms the  state-of-the-art methods. This 
paper describes that Twitter texts reflect 
the real world, and that NLP techniques 
can  be applied to extract only  tweets that 
contain useful information.

TWITTER IMPROVES SEASONAL INFLUENZA PREDICTION
http://www.cs.uml.edu/~hachreka/SNEFT/images/healthinf_2012.pdf

Data Set To be Potentially Purchased

MySpace User Activity Stream: User count by lat/long
http://www.infochimps.com/datasets/myspace-user-activity-stream-user-count-by-latlong
The data is derived from the MySpace real-time stream API. It contains all of the users in the dataset, around 11 million, with well-formed latitude and longitude.


ClueWeb 2009
http://lemurproject.org/clueweb09.php/