Posts

Distance-generalized Core Decomposition

Image
Distance-generalized Core Decomposition

A summary of the SIGMOD 2019 research paper by Francesco Bonchi, Arijit Khan, and Lorenzo Severini
Background: Extracting dense structures from large graphs has emerged as a key graph-mining primitive in a variety of application scenarios, ranging from web mining, to biology, and finance. Many different definitions of dense subgraphs have been proposed (e.g., cliques, n-cliques, n-clans, k-plexes, f-groups, n-clubs, lambda sets), but most of them are NP-hard or at least quadratic. In this respect, the concept of core decomposition is particularly appealing because (i) it can be computed in linear time, and (ii) it is related to many of the various definitions of a dense subgraph and it can be used to speed-up or approximate their computation. The k-core of a graph is defined as a maximal subgraph in which every vertex is connected to at least k other vertices within that subgraph. The set of all k-cores of a graph, for each k, forms its core decom…

In-Depth Comparison of st Reliability Algorithms over Uncertain Graphs

Image
An In-Depth Comparison of s-t Reliability Algorithms over Uncertain Graphs

A summary of the PVLDB 2019 research paper by Xiangyu Ke, Arijit Khan, and Leroy Lim Hong Quan

Uncertain, or probabilistic, graphs have been increasingly used to represent noisy linked data in many emerging applications, and have recently attracted the attention of the database research community [7]. A fundamental problem on uncertain graphs is the s-t reliability, which measures the probability that a target node t is reachable from a source node s in a probabilistic (or uncertain) graph, i.e., a graph where every edge is assigned a probability of existence.  This s - t reliability estimation has been used in many applications such as measuring the quality of connections between two terminals in a sensor network, finding other proteins that are highly probable to be connected with a specific protein in a protein-protein interaction (PPI) network, identifying highly reliable peers containing some file to transfe…
Image
Notes on Expressibility of Vertex Centric Graph Processing Paradigm

A summary of the IEEE BigData 2018 research paper by Siyuan Liu and Arijit Khan

We study the vertex-centric (VC) paradigm for distributed graph processing, popularized by Google’s Pregel system [1]. In Pregel, which was inspired by the Bulk Synchronous Parallel (BSP) model [2], graph algorithms are expressed as a sequence of iterations called supersteps (Figure 1). Each superstep is an atomic unit of parallel computation. During a superstep, Pregel executes a user-defined function for each vertex in parallel. The user-defined function specifies the operation at a single vertex v and at a single superstep S. The supersteps are globally synchronous among all vertices, and messages are usually sent along the outgoing edges from each vertex. In 2012, Yahoo! launched the Apache Giraph [3] as an open-source project, which clones the concepts of Pregel. Since then, there were several attempts to implement many graph algorithms…

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Image
On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

A summary of the USENIX ATC 2018 research paper by Arijit Khan, Gustavo Segovia, and Donald Kossmann

Background: Distributed Graph Querying & Graph Partitioning
Graphs with millions of nodes and billions of edges are ubiquitous to represent highly interconnected structures including the World Wide Web, social networks, knowledge graphs, genome and scientific databases, medical and government records. To support online search and query services (possibly from many clients) with low latency and high throughput, data centers and cloud operators consider scale-out solutions, in which the graph and its data are partitioned horizontally across cheap commodity servers. We assume that the graph topology and the data associated with nodes and edges are co-located, since they are often accessed together. Keeping with the modern database trends to support low-latency operations, we target a fully in-memory system, and…

Jointly Finding Seeds and Relevant Tags in Social Networks

Image
Jointly Finding Seeds and Relevant Tags in Social Networks for Targeted Influence Maximization
A summary of the SIGMOD 2018 research paper by Xiangyu KE, Arijit KHAN and Gao CONG
[Influence Maximization]
Nowadays online social networks have become an important medium for campaigners to promote their products. It is quite normal that a social network has millions of users, and it is impossible for a campaigner to directly reach out to all of them via advertisements, free samples, or discounted prices. If we see our friends' tweets praising some products, we probably also adopt them. Therefore, it is important to identify a small set of users (with cardinality k) who can influence as many other users as possible in the social network. It is known as the Influence Maximization problem, and has been widely studied [1, 2]. Moreover, a campaigner can further specify her target customers, either explicitly, or via some constraints (e.g., people in age group 20~30), which is referred to as …
Image
Probabilistic Entity Resolution with Imperfect Crowd: PERC
A SUMMARY OF THE CIKM 2017 RESEARCH PAPER and VLDB 2018 DEMOnstration BY VIJAYA KRISHNA YALAVARTHI, XIANGYU KE, AND ARIJIT KHAN
Introduction: In real world scenarios, we encounter many situations where two objects may not look exactly identical, but they represent the same entity.
Example 1: 1. Vijaya Krishna Yalavarthi  2. Yalavarthi Vijaya Krishna  3. Xiangyu Ke  4. Arijit Khan
In the above table, among the four names given, first two names represent the same person - Vijaya Krishna Yalavarthi, though they are not written in an identical way.
Example 2:


Among the three images given, A and B represent Taj Mahal (India), whereas C represents Merlion (Singapore). Though A and B do not look identical, they represent the same entity - Taj Mahal.
What is entity resolution? Entity resolution (ER) [2] is the task of linking and clustering of all records that represent the same real-world entity. ER, also known as record linkage or deduplication…