Notes on Expressibility of Vertex Centric Graph Processing Paradigm

A summary of the IEEE BigData 2018 research paper by Siyuan Liu and Arijit Khan

We study the vertex-centric (VC) paradigm for distributed graph processing, popularized by Google’s Pregel system [1]. In Pregel, which was inspired by the Bulk Synchronous Parallel (BSP) model [2], graph algorithms are expressed as a sequence of iterations called supersteps (Figure 1). Each superstep is an atomic unit of parallel computation. During a superstep, Pregel executes a user-defined function for each vertex in parallel. The user-defined function specifies the operation at a single vertex v and at a single superstep S. The supersteps are globally synchronous among all vertices, and messages are usually sent along the outgoing edges from each vertex. In 2012, Yahoo! launched the Apache Giraph [3] as an open-source project, which clones the concepts of Pregel. Since then, there were several attempts to implement many graph algorithms…

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

A summary of the USENIX ATC 2018 research paper by Arijit Khan, Gustavo Segovia, and Donald Kossmann

Background: Distributed Graph Querying & Graph Partitioning
Graphs with millions of nodes and billions of edges are ubiquitous to represent highly interconnected structures including the World Wide Web, social networks, knowledge graphs, genome and scientific databases, medical and government records. To support online search and query services (possibly from many clients) with low latency and high throughput, data centers and cloud operators consider scale-out solutions, in which the graph and its data are partitioned horizontally across cheap commodity servers. We assume that the graph topology and the data associated with nodes and edges are co-located, since they are often accessed together. Keeping with the modern database trends to support low-latency operations, we target a fully in-memory system, and…

Jointly Finding Seeds and Relevant Tags in Social Networks

Jointly Finding Seeds and Relevant Tags in Social Networks for Targeted Influence Maximization
A summary of the SIGMOD 2018 research paper by Xiangyu KE, Arijit KHAN and Gao CONG
[Influence Maximization]
Nowadays online social networks have become an important medium for campaigners to promote their products. It is quite normal that a social network has millions of users, and it is impossible for a campaigner to directly reach out to all of them via advertisements, free samples, or discounted prices. If we see our friends' tweets praising some products, we probably also adopt them. Therefore, it is important to identify a small set of users (with cardinality k) who can influence as many other users as possible in the social network. It is known as the Influence Maximization problem, and has been widely studied [1, 2]. Moreover, a campaigner can further specify her target customers, either explicitly, or via some constraints (e.g., people in age group 20~30), which is referred to as …
Probabilistic Entity Resolution with Imperfect Crowd: PERC
Introduction: In real world scenarios, we encounter many situations where two objects may not look exactly identical, but they represent the same entity.
Example 1: 1. Vijaya Krishna Yalavarthi  2. Yalavarthi Vijaya Krishna  3. Xiangyu Ke  4. Arijit Khan
In the above table, among the four names given, first two names represent the same person - Vijaya Krishna Yalavarthi, though they are not written in an identical way.
Example 2:

Among the three images given, A and B represent Taj Mahal (India), whereas C represents Merlion (Singapore). Though A and B do not look identical, they represent the same entity - Taj Mahal.
What is entity resolution? Entity resolution (ER) [2] is the task of linking and clustering of all records that represent the same real-world entity. ER, also known as record linkage or deduplication…