ラベル paper の投稿を表示しています。 すべての投稿を表示
ラベル paper の投稿を表示しています。 すべての投稿を表示

2012年2月11日土曜日

2012年1月11日水曜日

NIMBLE: A Toolkit for the Implementation of Parallel Data Mining and Machine Learning Algorithms on MapReduce

NIMBLE: A Toolkit for the Implementation of Parallel Data
Mining and Machine Learning Algorithms on MapReduce
http://users.cis.fiu.edu/~lzhen001/activities/KDD2011Program/docs/p334.pdf


In the last decade, advances in data collection and storage technologies have led to an increased interest in designing and implementing large-scale parallel algorithms for machine learning and
data mining (ML-DM). Existing programming paradigms for expressing large-scale parallelism such as MapReduce (MR) and the
Message Passing Interface (MPI) have been the de facto choices
for implementing these ML-DM algorithms. The MR programming paradigm has been of particular interest as it gracefully handles large datasets and has built-in resilience against failures. However, the existing parallel programming paradigms are too low-level
and ill-suited for implementing ML-DM algorithms. To address
this deficiency, we present NIMBLE, a portable infrastructure that
has been specifically designed to enable the rapid implementation
of parallel ML-DM algorithms. The infrastructure allows one to
compose parallel ML-DM algorithms using reusable (serial and
parallel) building blocks that can be efficiently executed using MR
and other parallel programming models; it currently runs on top of
Hadoop, which is an open-source MR implementation. We show
how NIMBLE can be used to realize scalable implementations of
ML-DM algorithms and present a performance evaluation

2012年1月7日土曜日

Wikipedia SOM (Self Organizing Map)

Wikipedia SOM

https://kaigi.org/jsai/webprogram/2011/pdf/329.pdf


Web ブラウザを経由して誰でも編集可能なオンライン百科
事典「Wikipedia」は,半構造化されたデータ構造を持ち,幅
広い分野に高い網羅性を持つなどの特徴を持つことから,人
工知能,自然言語処理,Web マイニングをはじめとする各種
の研究分野で,コーパスとして活用されてきた.Wikipedia 上
に公開される情報は日々増加しており,全ての言語を合計する
と,1,800 万以上の記事が存在する.この結果,どの分野にど
の程度の情報が存在し,分野同士がどうつながっているのか,
といったような Wikipedia の全体像を把握することが困難に
なっている.Wikipedia をコーパスとして利用する研究にお
いては,データの特性に応じてアルゴリズムを設計することが
多いため,どのような記事集合がどれほどあり,どのようなク
ラスタがあるのか,クラスタ間の関係はどうなっているのかな
ど,全体を俯瞰することが重要である.また,Wikipedia の閲
覧や編集など一般ユーザとして関わる場合にも,全体を俯瞰す
ることは,不足している情報を把握することや分野同士の関係
性を調べるといった用途において重要であると考えられる.
本研究では,神経細胞移動に着想を得た自己組織化マップ
アルゴリズムの「MIGSOM」[Nakayama 11] をWikipedia に
適用し,全体情報を俯瞰する方法を提案する.MIGSOM は,
大規模な疎データを可視化し,文書マップを作成する技術であ
る.MIGSOM には二つの特徴がある.一つは大規模なデータ
に適用した時にも安定したクラスタリング性能が期待できる点
である.もう一方の特徴は,ズーム機能を利用したクラスタ解
析が可能な点である.これにより,大局的なクラスタと局所的
なクラスタの解析が可能になっ

Graph + Bioinformatics

Causal graph-based analysis of genome-wide association data in rheumatoid arthritis
http://www.biology-direct.com/content/6/1/25

Graph-based methods for analysing networks in cell biology
http://bib.oxfordjournals.org/content/7/3/243.full
Availability of large-scale experimental data for cell biology is enabling computational methods to systematically model the behaviour of cellular networks. This review surveys the recent advances in the field of graph-driven methods for analysing complex cellular networks. The methods are outlined on three levels of increasing complexity, ranging from methods that can characterize global or local structural properties of networks to methods that can detect groups of interconnected nodes, called motifs or clusters, potentially involved in common elementary biological functions. We also briefly summarize recent approaches to data integration and network inference through graph-based formalisms. Finally, we highlight some challenges in the field and offer our personal view of the key future trends and developments in graph-based analysis of large-scale datasets.

Graph-based clustering and characterization of repetitive sequences in next-generation sequencing data
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2912890/

The investigation of plant genome structure and evolution requires comprehensive characterization of repetitive sequences that make up the majority of higher plant nuclear DNA. Since genome-wide characterization of repetitive elements is complicated by their high abundance and diversity, novel approaches based on massively-parallel sequencing are being adapted to facilitate the analysis. It has recently been demonstrated that the low-pass genome sequencing provided by a single 454 sequencing reaction is sufficient to capture information about all major repeat families, thus providing the opportunity for efficient repeat investigation in a wide range of species. However, the development of appropriate data mining tools is required in order to fully utilize this sequencing data for repeat characterization.
Results
We adapted a graph-based approach for similarity-based partitioning of whole genome 454 sequence reads in order to build clusters made of the reads derived from individual repeat families. The information about cluster sizes was utilized for assessing the proportion and composition of repeats in the genomes of two model species, Pisum sativum and Glycine max, differing in genome size and 454 sequencing coverage. Moreover, statistical analysis and visual inspection of the topology of the cluster graphs using a newly developed program tool, SeqGrapheR, were shown to be helpful in distinguishing basic types of repeats and investigating sequence variability within repeat families.
Conclusions
Repetitive regions of plant genomes can be efficiently characterized by the presented graph-based analysis and the graph representation of repeats can be further used to assess the variability and evolutionary divergence of repeat families, discover and characterize novel elements, and aid in subsequent assembly of their consensus sequences.

Graph-based data mining for biological applications
https://lirias.kuleuven.be/bitstream/123456789/267094/1/phd_leander_schietgat_final_version.pdf

Graph + Text Analysis

Plagiarism Detection Using Graph-Based Representation
http://www.mendeley.com/research/plagiarism-detection-using-graphbased-representation/

Plagiarism of material from the Internet is a widespread and growing problem. Several methods used to detect the plagiarism and similarity between the source document and suspected documents such as fingerprint based on character or n-gram. In this paper, we discussed a new method to detect the plagiarism based on graph representation; however, Preprocessing for each document is required such as breaking down the document into its constituent sentences. Segmentation of each sentence into separated terms and stop word removal. We build the graph by grouping each sentence terms in one node, the resulted nodes are connected to each other based on order of sentence within the document, all nodes in graph are also connected to top level node "Topic Signature". Topic signature node is formed by extracting the concepts of each sentence terms and grouping them in such node. The main advantage of the proposed method is the topic signature which is main entry for the graph is used as quick guide to the relevant nodes. which should be considered for the comparison between source documents and suspected one. We believe the proposed method can achieve a good performance in terms of effectiveness and efficiency.



テキスト分析における2部グラフクラスタリングの可能性(テキストの類似性・文処理モデル)
Possibilities of the Bipartite Graph Clustering in Text Analysis

http://ci.nii.ac.jp/naid/110004809727
テキスト解析にマルコフクラスタリング(MCL)、およびそれを独自に改良したリカレント・マルコフクラスタリング(RMCL)を利用する場合、有効なデータ取得法として、キーワードと共起語のペアに基づく、2部グラフ化の方法を提案し、MCL-RMCLによる2部グラフクラスタリングの計算結果を、従来のベクトル空間モデルに基づく多変量解析と比較、有効性を検証する.

Graph + CyberSecurity (2)

EigenDiagnostics: Spotting Connection Patterns and Outliers in Large Graphs
http://www.computer.org/portal/web/csdl/doi/10.1109/ICDMW.2010.203

In a large weighted graph, how can we detect suspicious sub graphs, patterns, and outliers? A suspicious pattern could be a near-clique or a set of nodes bridging two or more near-cliques. This would improve intrusion detection in computer networks and network traffic monitoring. Are there other network patterns that need to be detected? We propose EigenDiagnostics, a fast algorithm that spots such patterns. The process creates scatter-plots of the node properties (such as eigenscores, degree, and weighted degree), then looks for linear-like patterns. Our tool automatically discovers such plots, using the Hough transform from machine vision. We apply EigenDiagnostics on a wide variety of synthetic and real data (LBNL computer traffic, movie-actor data from IMDB, Patent citations, and more). EigenDiagnostics finds surprising patterns. They appear to correspond to port-scanning (in computer networks), repetitive tasks with bot-net-like behavior, strange gbridgesh in movie-actor data (due to actors changing careers, for example), and more. The advantages are: (a) it is effective in discovering surprising patterns. (b) it is fast (linear on the number of edges) (c) it is parameter-free, and (d) it is general, and applicable to many, diverse graphs, spanning tens of GigaBytes.


Mining and Modeling Real Graphs: Patterns, Generators, Anomalies, and Tools
http://www.cs.cmu.edu/~lakoglu/proposal/lakoglu-proposal.pdf




Community-based anomaly detection in evolutionary networks
http://www.springerlink.com/content/b61165511117u863/
Networks of dynamic systems, including social networks, the World Wide Web, climate networks, and biological networks, can be highly clustered. Detecting clusters, or communities, in such dynamic networks is an emerging area of research; however, less work has been done in terms of detecting community-based anomalies. While there has been some previous work on detecting anomalies in graph-based data, none of these anomaly detection approaches have considered an important property of evolutionary networks—their community structure. In this work, we present an approach to uncover community-based anomalies in evolutionary networks characterized by overlapping communities. We develop a parameter-free and scalable algorithm using a proposed representative-based technique to detect all six possible types of community-based anomalies: grown, shrunken, merged, split, born, and vanished communities. We detail the underlying theory required to guarantee the correctness of the algorithm. We measure the performance of the community-based anomaly detection algorithm by comparison to a non–representative-based algorithm on synthetic networks, and our experiments on synthetic datasets show that our algorithm achieves a runtime speedup of 11–46 over the baseline algorithm. We have also applied our algorithm to two real-world evolutionary networks, Food Web and Enron Email. Significant and informative community-based anomaly dynamics have been detected in both cases.



Using Bayesian Networks for Cyber Security Analysis
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5544924
Capturing the uncertain aspects in cyber security is important for security analysis in enterprise networks. However, there has been insufficient effort in studying what modeling approaches correctly capture such uncertainty, and how to construct the models to make them useful in practice. In this paper, we present our work on justifying uncertainty modeling for cyber security, and initial evidence indicating that it is a useful approach. Our work is centered around near real-time security analysis such as intrusion response. We need to know what is really happening, the scope and severity level, possible consequences, and potential countermeasures. We report our current efforts on identifying the important types of uncertainty and on using Bayesian networks to capture them for enhanced security analysis. We build an example Bayesian network based on a current security graph model, justify our modeling approach through attack semantics and experimental study, and show that the resulting Bayesian network is not sensitive to parameter perturbation.


Applying Graph-Based Anomaly Detection Approaches to the Discovery of Insider Threats
http://eecs.wsu.edu/~holder/pubs/EberleISI09.pdf

The ability to mine data represented as a graph has
become important in several domains for detecting various
structural patterns. One important area of data mining is
anomaly detection, but little work has been done in terms of
detecting anomalies in graph-based data. In this paper we
present graph-based approaches to uncovering anomalies in
applications containing information representing possible insider
threat activity: e-mail, cell-phone calls, and order processing.




Graph-based malware detection using dynamic analysis
http://www.mendeley.com/research/graphbased-malware-detection-using-dynamic-analysis/

Visualizing graph dynamics and similarity for enterprise network security and management
Managing complex enterprise networks requires an understanding at a finer granularity than traditional network monitoring. The ability to correlate and visualize the dynamics and inter-relationships among various network components such as hosts, users, and applications is non-trivial. In this paper, we propose a visualization approach based on the hierarchical structure of similarity/difference visualization in the context of heterogeneous graphs. The concept of hierarchical visualization starts with the evolution of inter-graph states, adapts to the visualization of intra-graph clustering, and concludes with the visualization of similarity between individual nodes. Our visualization tool, ENAVis (Enterprise Network Activities Visualization), quantifies and presents these important changes and dynamics essential to network operators through a visually appealing and highly interactive manner. Through novel graph construction and transformation, such as network connectivity graphs, MDS graphs, bipartite graphs, and similarity graphs, we demonstrate how similarity/dynamics can be effectively visualized to provide insight with regards to network understanding.



A Graph Similarity-based Approach to Security
Event Analysis Using Correlation Techniques

http://dl.acm.org/citation.cfm?id=1850799
—Detecting and identifying security events to provide
cyber situation awareness has become an increasingly important
task within the network research and development community.
We propose a graph similarity-based approach to event detection
and identification that integrates a number of techniques to
collect time-varying situation information, extract correlations
between event attributes, and characterize and identify security
events. Diverging from the traditional rule- or statistical-based
pattern matching techniques, the proposed mechanism represents
security events in a graphical form of correlation networks and
identifies security events through the computation of graph similarity measurements to eliminate the need for constructing user
or system profiles. These technical components take fundamentally different approaches from traditional empirical or statistical
methods and are designed based on rigorous computational
analysis with mathematically proven performance guarantee.
The performance superiority of the proposed mechanism is
demonstrated by extensive simulation and experimental results

Graph Based Statistical Analysis of Network Traffic
http://www.cs.purdue.edu/mlg2011/papers/paper_10.pdf

We propose a method for analyzing tra c data in large com-
puter networks such as big enterprise networks or the In-
ternet. Our approach combines graph theoretical represen-
tation of the data and graph analysis with novel statistical
methods for discovering pattern and time-related anomalies.
We model the tra c as a graph and use temporal charac-
teristics of the data in order to decompose it into subgraphs
corresponding to individual sessions, whose characteristics
are then analyzed using statistical methods. The goal of
that analysis is to discover patterns in the network tra c
data that might indicate intrusion activity or other mali-
cious behavior


Using Graph Theory to Detect Security Policy Violators
http://www.mawode.com/~waltman/talks/plug05emailgraph.pdf

Understanding Multistage Attacks by Attack-Track based Visualization of Heterogeneous Event Streams
http://www.cse.buffalo.edu/~shambhu/documents/pdf/vizsec01s-mathew.pdf

In this paper, we present a method of handling the visualization of hetereogeneous event traffic that is generated by
intrusion detection sensors, log files and other event sources
on a computer network from the point of view of detecting
multistage attack paths that are of importance. We perform
aggregation and correlation of these events based on their semantic content to generate Attack Tracks that are displayed
to the analyst in real-time. Our tool, called the Event Correlation for Cyber-Attack Recognition System (ECCARS) enables the analyst to distinguish and separate an
evolving multistage attack from the thousands of events generated on a network. We focus here on presenting the environment and framework for multistage attack detection
using ECCARS along with screenshots that demonstrate its
capabilities.
Categories an

Scenario Graphs Applied to Network Security
http://www.cs.cmu.edu/~scenariograph/wing07.pdf

Traditional model checking produces one counterexample to illustrate a violation of a property by a
model of the system. Some applications benefit from having all counterexamples, not just one. We call this set of
counterexamples a scenario graph. In this chapter we present two different algorithms for producing scenario graphs
and explain how scenario graphs are a natural representation for attack graphs used in the security community.
Through a detailed concrete example, we show how we can model a computer network and generate and analyze
attack graphs automatically. The attack graph we produce for a network model shows all ways in which an intruder
can violate a given desired security property

Network Security Evaluation through Attack Graph Generation
http://www.waset.org/journals/waset/v54/v54-73.pdf
In today’s network, security evaluation is a challenging
task for most of the administrators. The typical means by which an
attacker breaks into a network is through a series of exploits, where
each exploit in the series satisfies the precondition for subsequent
exploits and makes a causal relationship among them. Such a series of
exploits constitutes an attack path and the set of all possible attack
paths form an attack graph. Even the well administered networks are
susceptible to such attacks as present day vulnerability scanners are
only able to identify the vulnerabilities in isolation but there is a need
for logical formalism and correlation among these vulnerabilities
within a host or across multiple hosts to identify overall risk of the
network. In this paper we propose a network security analysis method
by the generation of network attack graph. After analyzing network
vulnerabilities, linking relation between devices and the characteristic
of attack, the model of network security states is built, and the
generating algorithm of attack graph is implemented. Attack graphs
are important tools for analyzing security vulnerabilities in enterprise
networks. The experiment validates the evaluation method we
proposed.

Graph + CyberSecurity

Making Cyber Security Decisions Through a. Quantitative Metrics Approach
http://eden.dei.uc.pt/~mvieira/raci2011/Sanders_RACI_keynote.pdf


Least Effort Strategies for Cybersecurit
http://arxiv.org/pdf/cond-mat/0306002

Cybersecurity is an issue of increasing concern since the events of September 11
Many questions have been raised concerning the security of the Internet and the rest of US’s
information infrastructure. This paper begins to examine the issue by analyzing the Internet’s
autonomous system (AS) map. Using the AS map, malicious infections are simulated and
different defense strategies are considered in a cost benefit framework. The results show that
protecting the most connected nodes provides significant gains in security and that after the small
minority of the most connected nodes are protected there are diminishing returns for further
protection. Although if parts of the small minority of the most connected firm are not protected,
such as non-US firms, protection levels are significantly decrea

Graph-based Data Mining

Graph-based data mining
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=850825
Using databases represented as graphs, the Subdue system performs two key data mining techniques: unsupervised pattern discovery and supervised concept learning from examples. Applications to large structural databases demonstrate Subdue's scalability and effectiveness

CyberSecurity + Graph

Cyber Security Risks Assessment with Bayesian Defense Graphs and Architectural Models
http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4755419

To facilitate rational decision making regarding cyber security investments, decision makers need to be able to assess expected losses before and after potential investments. This paper presents a model based assessment framework for analyzing the cyber security provided by different architectural scenarios. The framework uses the Bayesian statistics based extended influence diagrams to express attack graphs and related countermeasures. In this paper it is demonstrated how this structure can be captured in an abstract model to support analysis based on architectural models. The approach allows calculating the probability that attacks will succeed and the expected loss of these given the instantiated architectural scenario. Moreover, the framework can handle the uncertainties that are accompanied to the analyses. In architectural analysis there are uncertainties acquainted both to the scenario and its properties, as well as to the analysis framework that stipulates how security countermeasures contribute to cyber security.

entrifuge 2.0 - Cyber Security Analysis - Identify Insights with Relationship Graphs
http://www.youtube.com/watch?v=8kJk0mBh8sg
http://www.youtube.com/watch?v=eZqxRGBwIO0&feature=related


Attack Graphs for Proactive Digital Forensics



A Mathematical Basis for Science-Based Cybersecurity

Mining Graph Patterns Efficiently via Randomized Summaries
http://www.cs.uiuc.edu/~hanj/pdf/vldb09_cchen.pdf

Graphs are prevalent in many domains such as Bioinformatics, social networks, Web and cyber-security. Graph pattern
mining has become an important tool in the management
and analysis of complexly structured data, where example
applications include indexing, clustering and classification.
Existing graph mining algorithms have achieved great success by exploiting various properties in the pattern space.
Unfortunately, due to the fundamental role subgraph isomorphism plays in these methods, they may all enter into a
pitfall when the cost to enumerate a huge set of isomorphic
embeddings blows up, especially in large graphs.
The solution we propose for this problem resorts to reduction on the data space. For each graph, we build a summary of it and mine this shrunk graph instead. Compared
to other data reduction techniques that either reduce the
number of transactions or compress between transactions,
this new framework, called Summarize-Mine, suggests a
third path by compressing within transactions. SummarizeMine is effective in cutting down the size of graphs, thus
decreasing the embedding enumeration cost. However, compression might lose patterns at the same time. We address
this issue by generating randomized summaries and repeating the process for multiple rounds, where the main idea is
that true patterns are unlikely to miss from all rounds. We
provide strict probabilistic guarantees on pattern loss likelihood. Experiments on real malware trace data show that
Summarize-Mine is very efficient, which can find interesting malware fingerprints that were not revealed previously


Cyber Security Link Analysis Graph
http://www.analyticbridge.com/photo/cyber-security-link-analysis
This visualization shows interesting patterns of behavior for recent network login traffic. The linkages are between source and destination IPs. The circular stars show one-to-one relationships representing normal behavior. But the unusual pattern in the lower central shows a destintation IP under attack -- it has over 100 hundred source IPs sending it traffic.


Oil Companies Stepping up Cyber Security as Hacking Attacks Increase
http://oilprice.com/Energy/Energy-General/Oil-Companies-Stepping-up-Cyber-Security-as-Hacking-Attacks-Increase.html


An Attack Graph Based Approach for Threat Identification of an Enterprise Network
http://www.igi-global.com/chapter/cyber-security-global-information-assurance/7409

The science of cyber security experimentation: the DETER project
http://dl.acm.org/citation.cfm?id=2076752
Since 2004, the DETER Cyber-security Project has worked to create an evolving infrastructure - facilities, tools, and processes - to provide a national resource for experimentation in cyber security. Building on our insights into requirements for cyber science and on lessons learned through 8 years of operation, we have made several transformative advances towards creating the next generation of DeterLab. These advances in experiment design and research methodology are yielding progressive improvements not only in experiment scale, complexity, diversity, and repeatability, but also in the ability of researchers to leverage prior experimental efforts of other researchers in the DeterLab user community. This paper describes the advances resulting in a new experimentation science and a transformed facility for cybersecurity research development and evaluation.


Blog Data Mining for Cyber Security Threats
http://ewinarko.staff.ugm.ac.id/datamining/tugas2/09-app-blogmining-cybersec.pdf
Blog data mining is a growing research area that addresses the domainspecific problem of extracting information from blog data. In our work, we analyzed
blogs for various categories of cyber threats related to the detection of security
threats and cyber crime. We have extended the Author-Topic model based on Latent Dirichlet Allocation for identify patterns of similarities in keywords and dates
distributed across blog documents. From this model, we visualized the content and
date similarities using the Isomap dimensionality reduction technique. Our findings
support the theory that our probabilistic blog model can present the blogosphere in
terms of topics with measurable keywords, hence aiding the investigative processes
to understand and respond to critical cyber security events and threats.


Insider Threat Detection Using Graph-Based Approach
http://eecs.wsu.edu/~holder/pubs/EberleCATCH09.pdf

Protecting our nation’s cyber infrastructure and
securing sensitive information are critical challenges
for homeland security and require the research,
development and deployment of new technologies
that can be transitioned into the field for combating
cyber security risks. Particular areas of concern are
the deliberate and intended actions associated with
malicious exploitation, theft or destruction of data, or
the compromise of networks, communications or
other IT resources, of which the most harmful and
difficult to detect threats are those propagated by an
insider. However, current efforts to identify
unauthorized access to information, such as what is
found in document control and management systems,
are limited in scope and capabilities.
In order to address this issue, this effort involves
performing further research and development on the
existing Graph-Based Anomaly Detection (GBAD)
system [3]. GBAD discovers anomalous instances of
structural patterns in data that represent entities,
relationships and actions. Input to GBAD is a labeled
graph in which entities are represented by labeled
vertices and relationships or actions are represented
by labeled edges between entities. Using the
minimum description length (MDL) principle to
identify the normative pattern that minimizes the
number of bits needed to describe the input graph
after being compressed by the pattern, GBAD
implements algorithms for identifying the three
possible changes to a graph: modifications,
insertions and deletions. Each algorithm discovers
those substructures that match the closest to the
normative pattern without matching exactly. As a
result, GBAD is looking for those activities that
appear to match normal (or legitimate) transactions,
but in fact are structurally different.
As a solution to the problem of insider threat
detection, we will apply GBAD to datasets that
represent the flow of information between entities, as
well as the actions that take place on the information.
This research involves the representation of datasets,
like a document control and management system, as a
graph, enhancement of GBAD’s performance levels,
and evaluation of GBAD on these datasets. In
previous research, GBAD has already achieved over
95% accuracy detecting anomalies in simulated
domains, with minimal false positives, on graphs of
up to 100,000 vertices.

Twitter + Lifelog

A location predictor based on dependencies between multiple lifelog data
http://dl.acm.org/citation.cfm?id=1867702


Towards trajectory-based experience sharing in a city
http://dl.acm.org/citation.cfm?id=2063221

Twitter as Economic Indicators

Twitter as Economic Indicators

http://dealmakersguide.com/twitter-as-an-economic-indicator/
http://professional.wsj.com/article/SB10001424052970204138204576598942105167646.html


Modeling public mood and emotion:
Twitter sentiment and socio-economic phenomena
http://arxiv.org/pdf/0911.1583

Twitter

Twitter under the microscope - First Monday

http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2317/206
Scholars, advertisers and political activists see massive online social networks as a representation of social interactions that can be used to study the propagation of ideas, social bond dynamics and viral marketing, among others. But the linked structures of social networks do not reveal actual interactions among people. Scarcity of attention and the daily rhythms of life and work makes people default to interacting with those few that matter and that reciprocate their attention. A study of social interactions within Twitter reveals that the driver of usage is a sparse and hidden network of connections underlying the “declared” set of friends and followers.3

Twitter + Influencer (Retweet)

Want to be Retweeted? Large Scale Analytics on
Factors Impacting Retweet in Twitter Network
http://www2.parc.com/isl/members/hong/publications/socialcomputing2010.pdf

Retweeting is the key mechanism for information
diffusion in Twitter. It emerged as a simple yet powerful way of
disseminating useful information. Even though a lot of
information is shared via its social network structure in Twitter,
little is known yet about how and why certain information
spreads more widely than others. In this paper, we examine a
number of features that might affect retweetability of tweets. We
gathered content and contextual features from 74M tweets and
used this data set to identify factors that are significantly
associated with retweet rate. We also built a predictive retweet
model. We found that, amongst content features, URLs and
hashtags have strong relationships with retweetability. Amongst
contextual features, the number of followers and followees as well
as the age of the account seem to affect retweetability, while,
interestingly, the number of past tweets does not predict
retweetability of a user’s tweet. We believe that this research
would inform the design of sensemaking tools for

Identifying Influencers on Twitter
http://thenoisychannel.com/2011/04/16/identifying-influencers-on-twitter/

More Twitter Analysis: Influencers Don't Retweet
http://www.readwriteweb.com/archives/more_twitter_analysis_influencers_dont_retweet.php

Measuring User Influence in Twitter: The Million Follower Fallacy
http://an.kaist.ac.kr/~mycha/docs/icwsm2010_cha.pdf


Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter
http://research.microsoft.com/pubs/102168/TweetTweetRetweet.pdf

Twitter + 災害

Twitter in Disaster Mode:
http://people.ee.ethz.ch/~hossmath/papers/woid_twimight.pdf

Recent natural disasters (earthquakes, floods, etc.) have shown
that people heavily use platforms like Twitter to communicate and organize in emergencies. However, the fixed infrastructure supporting such communications may be temporarily wiped out. In such situations, the phones’ capabilities of
infrastructure-less communication can fill in: By propagating data opportunistically (from phone to phone), tweets can
still be spread, yet at the cost of delays.
In this paper, we present Twimight and its network security extensions. Twimight is an open source Twitter client
for Android phones featured with a “disaster mode”, which
users enable upon losing connectivity. In the disaster mode,
tweets are not sent to the Twitter server but stored on the
phone, carried around as people move, and forwarded via
Bluetooth when in proximity with other phones. However,
switching from an online centralized application to a distributed and delay-tolerant service relying on opportunistic
communication requires rethinking the security architecture.
We propose security extensions to offer comparable security in the disaster mode as in the normal mode to protect
Twimight from basic attacks. We also propose a simple,
yet efficient, anti-spam scheme to avoid users from being
flooded with spam. Finally, we present a preliminary empirical performance evaluation of Twimight.
Categories and Subject Descriptors

Twitter の情報信頼性

Information Credibility on Twitter
http://www.www2011india.com/proceeding/proceedings/p675.pdf


We analyze the information credibility of news propagated
through Twitter, a popular microblogging service. Previous
research has shown that most of the messages posted on
Twitter are truthful, but the service is also used to spread
misinformation and false rumors, often unintentionally.
On this paper we focus on automatic methods for assessing the credibility of a given set of tweets. Specifically, we
analyze microblog postings related to “trending” topics, and
classify them as credible or not credible, based on features
extracted from them. We use features from users’ posting
and re-posting (“re-tweeting”) behavior, from the text of the
posts, and from citations to external sources.
We evaluate our methods using a significant number of
human assessments about the credibility of items on a recent
sample of Twitter postings. Our results shows that there are
measurable differences in the way messages propagate, that
can be used to classify them automatically as credible or
not credible, with precision and recall in the range of 70%
to 80%

Twitter + イベント検出 (2)

TwitterMonitor: Trend Detection over the Twitter Stream
http://queens.db.toronto.edu/~mathiou//TwitterMonitor.pdf

We present TwitterMonitor, a system that performs trend
detection over the Twitter stream. The system identifies
emerging topics (i.e. ‘trends’) on Twitter in real time and
provides meaningful analytics that synthesize an accurate
description of each topic. Users interact with the system
by ordering the identified trends using different criteria and
submitting their own description for each trend.
We discuss the motivation for trend detection over social media streams and the challenges that lie therein. We
then describe our approach to trend detection, as well as
the architecture of TwitterMonitor. Finally, we lay out our
demonstration scenario.

コード解析へのグラフ技術の応用

http://oss.infoscience.co.jp/gephi/wiki.gephi.org/index.php/Datasets.html
[GEXF] Java コード: Java プログラムのソースコード構造。S.Heymann & J.Palmier, 2008 による。
[GEXF] 動的 Java コード: SVN 上でのコミットの進行による Java プログラムの動的ソースコード構造。S.Heymann & J.Bilcke, 2008 による。
[TGZ] Github オープンソース開発者: http://lumberjaph.net/blog/index.php/2010/03/25/github-explorer/ を参照してください。

スマートグリッド

http://oss.infoscience.co.jp/gephi/wiki.gephi.org/index.php/Datasets.html
電力網: 重み付けのない無向ネットワークで、米国西部諸州電力網のトポロジーを表しています。D. Watts と S. Strogatz によって作成されたデータは Web から入手することができ、ここにあります。引用時には次の出典を明記してください: D. J. Watts and S. H. Strogatz, Nature 393, 440-442 (1998)。

Gephi

http://oss.infoscience.co.jp/gephi/gephi.org/plugins/graph-streaming/index.html
http://gephi.org/blog/

The purpose of the Graph Streaming API is to build a unified framework for streaming graph objects. Gephi’s data structure and visualization engine has been built with the idea that a graph is not static and might change continuously. By connecting Gephi with external data-sources, we leverage its power to visualize and monitor complex systems or enterprise data in real-time. Moreover, the idea of streaming graph data goes beyond Gephi, and a unified and standardized API could bring interoperability with other available tools for graph and network analysis, as they could start to interoperate with other tools in a distributed and cooperative fashion.

Data Set is available at
http://oss.infoscience.co.jp/gephi/wiki.gephi.org/index.php/Datasets.html

Twitter +

Twitter Power: Tweets as Electronic Word of Mouth
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.3321

In this paper we report research results investigating microblogging as a form of electronic word-of-mouth for sharing consumer opinions concerning brands. We analyzed more than 150,000 microblog postings containing branding comments, sentiments, and opinions.We investigated the overall structure of these microblog postings, the types of expressions, and the movement in positive or negative sentiment.We compared automated methods of classifying sentiment in these microblogs with manual coding. Using a case study approach, we analyzed the range, frequency, timing, and content of tweets in a corporate account. Our research findings show that 19% of microblogs contain mention of a brand. Of the branding microblogs, nearly 20 % contained some expression of brand sentiments. Of these, more than 50 % were positive and 33 % were critical of the company or product. Our comparison of automated and manual coding showed no significant differences between the two approaches. In analyzing microblogs for structure and composition, the linguistic structure of tweets approximate the linguistic patterns of natural language expressions. We find that microblogging is an online tool for customer word of mouth communications and discuss the implications for corporations using microblogging as part of their overall marketing strategy.