Most Probable Densest Subgraphs

A summary of the IEEE International Conference on Data Engineering (ICDE) 2023 research paper by Arkaprava Saha, Xiangyu Ke, Arijit Khan, and Cheng Long

Background: The discovery of dense subgraphs has attracted extensive attention in the data management community [1], [2], [3], [4], [5].They may correspond to communities [6], filter bubbles and echo chambers [7], [8] in social networks, brain regions responding to stimuli [9] or related to diseases [10], and commercial value motifs in financial domains [11]. They also have wide applications in graph compression and visualization [12], [13], [14], indexing for reachability and distance queries [15], [16], and social piggybacking [17]. Densest subgraphs usually maximize some notion of density in a given graph, e.g., the edge density [1], defined as the ratio of the number of induced edges to the number of nodes in a subgraph. Although there are an exponential number of subgraphs, a densest subgraph can be found both exactly and approximately in polynomial time [1], [2]. There also exist many other density metrics [18], such as the edge ratio, edge surplus, triangle, clique, and pattern density, etc. [19], [20], [21], [5]. For surveys and tutorials, we refer to [22], [4].

Surprisingly, except for maximum expected edge density [23], [33], the study of densest subgraph discovery on uncertain graphs is still absent. The expected edge density of an uncertain graph is defined as the expectation of the edge density value of a possible world (i.e., a deterministic graph) of the uncertain graph, chosen at random [23]. However, a subgraph of the uncertain graph having the maximum expected edge density may induce densest subgraphs only in a few (even zero) possible worlds of that uncertain graph. Such a subgraph can be large with many low-probability edges, or having nodes that are loosely connected (see Example 1). This defeats the purpose of finding a densest subgraph. Instead, many applications would require a densest subgraph with a high precision, such as being the densest with a high probability. 

Example 1. Figure 1 shows all possible worlds of an uncertain graph with their existence probabilities. It can be verified that, in each world, the connected component is also the densest subgraph. As Table I shows, the node set {A,B,C,D} has the maximum expected density (0.38), but it induces a densest subgraph only in possible worlds G7 and G8 with low existence probabilities (0.168 and 0.112). Thus, the probability of {A,B,C,D} inducing a densest subgraph is only 0.28. In contrast, the node set {B,D} has a slightly lower expected density (0.35), but its probability of inducing a densest subgraph is much higher (0.42), since it induces a densest subgraph in possible worlds G4 and G7 with high existence probabilities.

The Problems Studied: Based on edge density, clique density, and pattern density [5], we propose densest subgraph probability as a more sophisticated density metric over uncertain graphs. Given an uncertain graph G, our goal is to find the node set that is the most likely to induce a densest subgraph in G, i.e., maximize the sum of the probabilities of those possible worlds of G in which this node set induces a densest subgraph. We refer to the uncertain subgraph induced by this node set as the most probable densest subgraph (MPDS). We formulate and study the following novel problems of MPDS discovery in uncertain graphs: MPDS based on edge density, clique density, and pattern density; for each of them, their top-k variants.

In large graphs, we find that the densest subgraph probability of every possible node set may be quite small, due to the existence of many possible worlds, each having a smaller probability; and any two worlds might not have exactly the same densest subgraph. This defeats our purpose of identifying a node set that induces a densest subgraph with a high probability. In such cases, we propose to find those nodes which are most likely to form the “nucleus” of various densest subgraphs, i.e., whose containment probability within a densest subgraph is maximized. We consider nucleus densest subgraphs (NDS) variants of our MPDS problems.

Our Solution Framework: We prove that computing the densest subgraph probability is #P-hard. In spite of the #P-hardness of computing the densest subgraph probability, we design an efficient approximation algorithm for the top-k densest subgraphs discovery, with an end-to-end accuracy guarantee. Our solution for edge density-based MPDS is built on independent sampling of possible worlds (e.g., via Monte-Carlo sampling) and, in each of them, efficient enumeration of all edge-densest subgraphs (via [24]). We provide time and space complexity analyses and theoretical accuracy guarantees of our method.

Our algorithm can adapt well to clique and pattern densities while ensuring the same accuracy guarantee. Although there exist efficient algorithms to find one clique-densest and one pattern-densest subgraph in a deterministic graph [5], the problems of enumerating all clique- and pattern-densest subgraphs in a deterministic graph have not been studied earlier. However, such enumerations are required in our solution framework. Thus, as additional technical contributions, we develop novel, exact algorithms for efficiently enumerating all clique- and pattern densest subgraphs in a deterministic graph.

For nucleus densest subgraph (NDS) problems, we develop an approximate solution and present theoretical analyses about its accuracy-efficiency trade-offs. The novelty of our solution is that, by finding the maximum sized densest subgraph in each sampled world, we reduce this problem to the closed frequent itemset mining problem, for which efficient algorithms like TFP [25] exist.

Case Study: Our case studies on brain and social networks demonstrate useful real-world applications of the MPDS and NDS. A brain network can be defined as an uncertain graph where nodes are brain regions of interest (ROIs), an edge indicates co-activation between two ROIs, and an edge probability indicates the strength of the co-activation signal. Dense subgraphs in brain networks can represent brain regions responding together to stimuli [9] or related to diseases [10].

The dataset we use [26] contains data of 52 Typically Developed (TD) children and 49 children suffering from Autism Spectrum Disorder (ASD). Each subject is represented as a graph over 116 nodes (ROIs). GASD and GTD are uncertain graphs, defined over the same set of nodes as the original ones, while the probability of each edge is the average of those of the same edge across all graphs in the ASD and TD groups. Using BrainNet Viewer [27], we show the 3-clique MPDSs for both GTD and GASD in Figures 2 and 3. The MPDS in GASD lies entirely in the occipital lobe, in contrast to that in GTD, which also contains one node in the temporal lobe and one in the cerebellum. Besides, the MPDS in GASD is more symmetrical than that in GTD, since the former has only one node (MOG.R) without its counterpart in the other hemisphere, while the latter has two more such nodes (CRBL6.L and FFG.R). This is consistent with the results of different works in neuroscience indicating that, in contrast to typically developed brains, those affected by ASD are characterized by under-connectivity between distant brain regions and over-connectivity between closer ones [28], [29], and that the hemispheres of ASD-affected brains are more symmetrical than those of typically developed ones [30]. Our consistent findings underline the importance of finding MPDSs in uncertain brain networks that can differentiate healthy and autistic brains. We also show in our paper that the expected densest subgraph [23], probabilistic innermost core [31], and probabilistic innermost truss [32] cannot well-characterize and distinguish autistic brains.


For more information, click here for our 
paper.

References

[1] A. V. Goldberg, Finding a Maximum Density Subgraph. University of California Berkeley, 1984.

[2] M. Charikar, “Greedy approximation algorithms for finding dense components in a graph,” in International Workshop on Approximation Algorithms for Combinatorial Optimization. Berlin, Heidelberg: Springer, 2000, pp. 84–95.

[3] C. Tsourakakis, F. Bonchi, A. Gionis, F. Gullo, and M. Tsiarli, “Denser than the densest subgraph: Extracting optimal quasi-cliques with quality guarantees,” in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013, p. 104–112.

[4] A. Gionis and C. E. Tsourakakis, “Dense subgraph discovery,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, p. 2313–2314.

[5] Y. Fang, K. Yu, R. Cheng, L. V. Lakshmanan, and X. Lin, “Efficient algorithms for densest subgraph discovery,” Proceedings of the VLDB Endowment, vol. 12, no. 11, p. 1719–1732, 2019.

[6] Y. Dourisboure, F. Geraci, and M. Pellegrini, “Extraction and classification of dense implicit communities in the web graph,” ACM Transactions on the Web, vol. 3, no. 2, pp. 1–36, 2009.

[7] K. Asatani, H. Yamano, T. Sakaki, and I. Sakata, “Dense and influential core promotion of daily viral information spread in political echo chambers,” Scientific reports, vol. 11, no. 1, pp. 1–10, 2021.

[8] L. V. Lakshmanan, “On a quest for combating filter bubbles and misinformation,” in Proceedings of the 2022 ACM SIGMOD International Conference on Management of Data, 2022, p. 2.

[9] R. Legenstein, W. Maass, C. H. Papadimitriou, and S. S. Vempala, “Long term memory and the densest k-subgraph problem,” in 9th Innovations in Theoretical Computer Science Conference (ITCS 2018), ser. Leibniz International Proceedings in Informatics (LIPIcs), vol. 94. Dagstuhl, Germany: Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2018, pp. 57:1–57:15.

[10] Q. Wu, X. Huang, A. J. Culbreth, J. A. Waltz, L. E. Hong, and S. Chen, “Extracting brain disease-related connectome subgraphs by adaptive dense subgraph discovery,” Biometrics, 2021.

[11] X. Du, R. Jin, L. Ding, V. E. Lee, and J. H. Thornton, “Migration motif: A spatial-temporal pattern mining approach for financial markets,” in Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2009, p. 1135–1144.

[12] G. Buehrer and K. Chellapilla, “A scalable pattern mining approach to web graph compression with communities,” in Proceedings of the 2008 International Conference on Web Search and Data Mining, 2008, p. 95–106.

[13] Y. Zhang and S. Parthasarathy, “Extracting analyzing and visualizing triangle k-core motifs within networks,” in IEEE 28th international conference on data engineering, 2012, pp. 1049–1060.

[14] F. Zhao and A. K. H. Tung, “Large scale cohesive subgraphs discovery for social network visual analysis,” Proceedings of the VLDB Endowment, vol. 6, no. 2, p. 85–96, 2012.

[15] E. Cohen, E. Halperin, H. Kaplan, and U. Zwick, “Reachability and distance queries via 2-hop labels,” in Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms. USA: Society for Industrial and Applied Mathematics, 2002, p. 937–946.

[16] R. Jin, Y. Xiang, N. Ruan, and D. Fuhry, “3-hop: A high-compression indexing scheme for reachability query,” in Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, 2009, p. 813–826.

[17] A. Gionis, F. Junqueira, V. Leroy, M. Serafini, and I. Weber, “Piggybacking on social networks,” Proceedings of the VLDB Endowment, vol. 6, no. 6, p. 409–420, 2013.

[18] A. Farago and Z. R. Mojaveri, “In search of the densest subgraph,” Algorithms, vol. 12, no. 8, p. 157, 2019.

[19] C. E. Tsourakakis, “The k-clique densest subgraph problem,” in Proceedings of the 24th International Conference on World Wide Web. Republic and Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee, 2015, p. 1122–1132.

[20] M. Mitzenmacher, J. Pachocki, R. Peng, C. E. Tsourakakis, and S. C. Xu, “Scalable large near-clique detection in large-scale networks via sampling,” in Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015, p. 815–824.

[21] H. Yanagisawa and S. Hara, “Discounted average degree density metric and new algorithms for the densest subgraph problem,” Networks, vol. 71, no. 1, pp. 3–15, 2018.

[22] Y. Fang, W. Luo, and C. Ma, “Densest subgraph discovery on large graphs: Applications, challenges, and techniques,” Proc. VLDB Endow., vol. 15, no. 12, pp. 3766–3769, 2022.

[23] Z. Zou, “Polynomial-time algorithm for finding densest subgraphs in uncertain graphs,” in Proceedings of the 11th Workshop on Mining and Learning with Graphs, 2013.

[24] L. Chang and M. Qiao, “Deconstruct densest subgraphs,” in Proceedings of the 29th International Conference on World Wide Web, 2020, p. 2747–2753.

[25] J. Wang, J. Han, Y. Lu, and P. Tzvetkov, “Tfp: an efficient algorithm for mining top-k frequent closed itemsets,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 5, pp. 652–663, 2005.

[26] C. Craddock, Y. Benhajali, C. Chu, F. Chouinard, A. Evans, A. Jakab, B. S. Khundrakpam, J. D. Lewis, Q. Li, M. Milham, C. Yan, and P. Bellec, “The neuro bureau preprocessing initiative: Open sharing of preprocessed neuroimaging data and derivatives,” Frontiers in Neuroinformatics, 2013.

[27] M. Xia, J. Wang, and Y. He, “Brainnet viewer: a network visualization tool for human brain connectomics,” PloS one, vol. 8, no. 7, p. e68910, 2013.

[28] A. Di Martino, C. Kelly, R. Grzadzinski, X.-N. Zuo, M. Mennes, M. Mairena, C. Lord, F. Castellanos, and M. Milham, “Aberrant striatal functional connectivity in children with autism,” Biological Psychiatry, vol. 69, no. 9, pp. 847–56, 12 2010.

[29] S. Noonan, F. Haist, and R.-A. Muller, “Aberrant functional connectivity in autism: Evidence from low-frequency bold signal fluctuations,” Brain Research, vol. 1262, pp. 48–63, 02 2009.

[30] M. Postema, D. Van Rooij, E. Anagnostou, C. Arango, G. Auzias, M. Behrmann, G. Busatto, S. Calderoni, R. Calvo, E. Daly, C. Deruelle, A. Di Martino, I. Dinstein, F. Duran, S. Durston, C. Ecker, S. Ehrlich, D. Fair, J. Fedor, and C. Francks, “Altered structural brain asymmetry in autism spectrum disorder in a study of 54 datasets,” Nature Commu- nications, vol. 10, 12 2019.

[31] F. Bonchi, F. Gullo, A. Kaltenbrunner, and Y. Volkovich, “Core decomposition of uncertain graphs,” in Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, p. 1316–1325.

[32] X. Huang, W. Lu, and L. V. Lakshmanan, “Truss decomposition of probabilistic graphs: Semantics and algorithms,” in Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data, 2016, p. 77–90.

[33] C. E. Tsourakakis, T. Chen, N. Kakimura, and J. Pachocki, “Novel dense subgraph discovery primitives: Risk aversion and exclusion queries,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2020, pp. 378–394.

 

Comments

Popular posts from this blog

Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network