Big Graph Live

Multi-relation Graph Summarization

A summary of the ACM Transactions on Knowledge Discovery from Data (TKDD) Journal 2021 research paper by Xiangyu Ke, Arijit Khan, and Francesco Bonchi

Background: Multi-relation Graphs

Multi-relation networks (also known as multi-layer, multiplex, or multi-dimensional networks) are graphs where multiple edges of different types may exist between any pair of nodes [7]. Multi-relation graphs are an expressive model of real-world activities, in which a relation can be a topic in social networks, an interaction type in genetic networks, or a snapshot in temporal graphs. For instance, BioGRID (thebiogrid.org) describes seven different types of genetic interactions between genes in Homo Sapiens. STRING (string-db.org) models protein-to-protein interactions with six types of correlations statistically learned from existing protein databases, revealing that most protein interactions are associated with at least two types of correlations. Other applications where multiple relations exist between entities include social and financial networks [1], urban and transportation systems [2], ecology research [3], and recommender systems [4, 5, 6]. Multi-relation graph model is also attracting interest in graph analytics, such as in shortest path finding [8], core decomposition and densest subgraph discovery [1, 9], node clustering [10], frequent subgraphs mining [11], and in social networks analysis [12].

Background: Graph Summary

Graph summarization is beneficial in a wide range of applications, such as visualization, interactive and exploratory analysis, approximate query processing, reducing the on-disk storage footprint, graph embedding, and graph processing in modern hardware [13, 14, 15, 16, 17]. Navlakha et al. [18] studied lossless summarization for simple undirected single-relation graphs. Given a graph, a lossless summary consists of a summary graph with supernodes and superedges, and a set of (positive/ negative) edge corrections, where given the (lossless) summary, we can reconstruct the original graph exactly, by (1) “exploding” each superedge, (2) adding each edge in positive edge corrections , and (3) removing the edges in negative edge corrections . An example of lossless summary is provided in Figure 1. The problem studied by Navlakha et al. is to find the smallest possible summary of an input graph, i.e., the one that minimizes the number of superedges and the number of edge corrections. To solve it, they proposed a simple greedy agglomerative heuristic without quality guarantee.

Figure 1: (a) A simple graph and (b) its lossless summary consisting of two supernodes, one superedge, and a negative correction edge

A similar approach was followed by LeFevre and Terzi [19] who studied summaries obtained by aggregating nodes into supernodes. However, the method keeps on each superedge the information about how many original edges it represents (but not which ones): this is clearly a lossy summarization. The objective is then to find the summary minimizing the loss for a given number 𝑘 of allowed supernodes (which implicitly controls the compression rate). The loss is represented by the reconstruction error, i.e., the difference between the original graph and the probabilistic graph that one can reconstruct form the summary. Similar to [18], [19] proposed a simple greedy agglomerative heuristic with no quality guarantee. Later, Riondato et al. [20, 21] proposed the first polynomial-time approximation algorithm for the problem of [19]. In this work we show that, following a similar intuition, we can achieve the first polynomial-time approximation algorithm for the classic lossless summarization problem of [18].

Our Problem: Multi-relation Graph Summary

In this paper, we study, for the first time, the problem of multi-relation graph summarization. In particular, we extend the notion of lossless summarization over multi-relation graphs. An undirected, multi-relation graph 𝐺 is a triplet (𝑉, 𝐸, 𝑅), where 𝑉 is a set of nodes, 𝑅 is a set of relations, and 𝐸 ⊆ 𝑉 ×𝑉 × 𝑅 is a set of undirected edges. Therefore, in a multi-relation graph, each edge is a triplet: e.g., an edge between nodes 𝑢 and 𝑣 in relation 𝑟 ∈ 𝑅 is represented by (𝑢, 𝑣, 𝑟). A summary is defined as in the single-relation case, the only difference is that the correction edges are triplets and also superedges are now triplets. Our problem is to find the smallest possible summary of an input graph, i.e., the one that minimizes the number of superedges and the number of edge corrections.

Figure 2: (a) A multi-relation graph and (b) its lossless summary

Figure 2 provides an example of a multi-relation graph and its summary. The graph on the left-hand side is defined over 3 relations, contains 5 nodes and 16 edges. The summary on the right-hand side is obtained by grouping {𝑎, 𝑐} and {𝑏, 𝑑} as two supernodes and keeping {𝑒} as a supernode over all 3 relations. No correction is required here. The cost of such summary is thus 6 (given by 6 superedges + 0 correction), while the cost of the original graph was 16 (i.e., 16 edges).

Figure 3: The individual summary for each relation in Figure 2(a)

Why not keeping an individual summary for each relation? Figure 3 shows the optimal summary for each relation of the multi-relation graph in Figure 2(a). One could argue that storing these three individual summaries would also serve as a lossless summarization for the given multi-relation graph. However, this is not a good option due to two reasons: (1) The optimal multi-relation summary in Figure 2(b) provides us more insights about the input network. For example, by only looking at the individual summaries in Figure 3, we cannot easily determine the fact that the nodes in set {𝑎, 𝑐} are fully connected with the nodes in set {𝑏, 𝑑} via all relations. However, the nodes within each of these sets interact with themselves in different ways. Thus, it is better to characterize them as two strongly connected supernodes with different self-loops as in the multi-relation summary in Figure 2(b). (2) More storage is required for maintaining all individual summaries, because each individual summary might require storing of a different node mapping (i.e., the mapping from nodes to supernodes).

Our Contributions

Our main contribution is to initiate investigation into multi-relation graph summarization. Besides, our work achieves the following contributions:

• We revise the classic single-relation graph lossless summarization problem of [18], and provide the first polynomial-time approximation algorithm.

• For multi-relation lossless graph summarization, we design basic two-step algorithms, which first generate a lossless summary for each relation, then properly aggregate them to obtain one uniform summary. We also highlight the limits of this approach.

• We next propose holistic algorithms for more compact and efficient summary generation. Our holistic 𝑘-Median+ algorithm maintains the same approximation guarantee for multi-relation graph summarization.

• We combine the traditional Greedy method [18] and the approximation soluton 𝑘-Median as the final proposed algorithm Hybrid, which empirically produces the most compact summary.

• Our empirical evaluation on four real-world networks confirms that our holistic algorithms can produce more compact summaries and are faster than the two-step approaches.

• Real-world applications on visualization, classification, and query processing demonstrate the effectiveness and efficiency of our proposal.

Case Study: Visualization on DBLP Collaboration Network

We visualize the summary of DBLP dataset, and present some interesting case studies in Figure 4. In the first case shown in Figure 4(a), we find that all researchers in each supernode actively collaborate with others on the topics such as “data stream” and “approximation”. They have both self-loops and links to each other with such topics. However, a few different topics also appear within each group. For example, Graham Cormode also has quite a few “private”-related (i.e., privacy) joint works with Divesh Srivastava. Another interesting finding is that both the left-side and the top supernodes contain researchers from database community, since they have more publications in SIGMOD and VLDB, while the right-side supernode consists of theory community researchers, who publish more in STOC, SODA, etc.

Figure 4: Case study on DBLP

The second case in Figure 4(b) presents a group of people with close collaboration. One can verify that they are all from same geographical location, i.e., Hong Kong in this example. This is a frequent pattern in DBLP.

Figure 4(c) first shows that very senior researchers tend to be kept as a single node, since they work on quite diverse topics and with many researchers. Here, Jiawei Han and his ex-students, Jian Pei and Xifeng Yan, are all kept as a singleton supernode, and Jiawei Han has different collaboration topics with them. His recently graduated student, Xiang Ren (in 2018), and some other current students are grouped together, since they collaborate frequently with each other on topics related to knowledge extraction. Such correlations across different relations and node set could not be immediately inferred if one keeps individual summaries for all relations separately.

References

[1] E. Galimberti, F. Bonchi, and F. Gullo. 2017. Core Decomposition and Densest Subgraph in Multilayer Networks. In CIKM.

[2] A. Cardillo, J. Gómez-Gardeñes, M. Zanin, M. Romance, D. Papo, F. del Pozo, and S. Boccaletti. 2012. Emergence of Network Features from Multiplexity. CoRR abs/1212.2153 (2012).

[3] M. Stella, C. S. Andreazzi, S. Selakovic, A. Goudarzi, and A. Antonioni. 2017. Parasite Spreading in Spatial Ecological Multiplex Networks. J. Complex Networks 5, 3 (2017), 486–511.

[4] B. Jin, C. Gao, X. He, D. Jin, and Y. Li. 2020. Multi-behavior Recommendation with Graph Convolutional Networks. In SIGIR.

[5] M. Mao, J. Lu, G. Zhang, and J. Zhang. 2017. Multi-relational Social Recommendations via Multigraph Ranking. IEEE Trans. Cybernetics 47, 12 (2017), 4049–4061.

[6] L. Xia, C. Huang, Y. Xu, P. Dai, X. Zhang, H. Yang, J. Pei, and L. Bo. 2021. Knowledge-enhanced Hierarchical Graph Transformer Network for Multi-behavior Recommendation. In AAAI.

[7] M. E. Dickison, M. Magnani, and L. Rossi. 2016. Multilayer Social Networks. Cambridge University Press.

[8] X. Zhang and T. Özsu. 2019. Correlation Constraint Shortest Path over Large Multi-Relation Graphs. PVLDB 12, 5 (2019), 488–501.

[9] E. Galimberti, F. Bonchi, F. Gullo, and T. Lanciano. 2020. Core Decomposition in Multilayer Networks: Theory, Algorithms, and Applications. ACM Trans. Knowl. Discov. Data 14, 1 (2020), 11:1–11:40.

[10] B. Boden, S. Günnemann, H. Hoffmann, and T. Seidl. 2012. Mining Coherent Subgraphs in Multi-layer Graphs with Edge Labels. In KDD.

[11] X. Yan, X. J. Zhou, and J. Han. 2005. Mining Closed Relational Graphs with Connectivity Constraints. In KDD.

[12] M. Coscia, G. Rossetti, D. Pennacchioli, D. Ceccarelli, and F. Giannotti. 2013. "You Know because I Know": A Multidimensional Network Approach to Human Resources Problem. In ASONAM.

[13] M. Besta and T. Hoefler. 2018. Survey and Taxonomy of Lossless Graph Compression and Space-Efficient Graph Representations. CoRR abs/1806.01799 (2018).

[14] A. Khan, S. S. Bhowmick, and F. Bonchi. 2017. Summarizing Static and Dynamic Big Graphs. PVLDB 10, 12 (2017), 1981–1984.

[15] D. Koutra, J. Vreeken, and F. Bonchi. 2018. Summarizing Graphs at Multiple Scales: New Trends. In ICDM.

[16] Y. Liu, T. Safavi, A. Dighe, and D. Koutra. 2018. Graph Summarization Methods and Applications: A Survey. ACM Comput. Surv. 51, 3 (2018), 62:1–62:34.

[17] J. Yang, J. You, and X. Wan. 2021. Graph Embedding via Graph Summarization. IEEE Access 9 (2021), 45163–45174.

[18] S. Navlakha, R. Rastogi, and N. Shrivastava. 2008. Graph Summarization with Bounded Error. In SIGMOD.

[19] K. LeFevre and E. Terzi. 2010. GraSS: Graph Structure Summarization. In SDM.

[20] M. Riondato, D. García-Soriano, and F. Bonchi. 2014. Graph Summarization with Quality Guarantees. In ICDM.

[21] M. Riondato, D. García-Soriano, and F. Bonchi. 2017. Graph Summarization with Quality Guarantees. Data Min. Knowl. Discov. 31, 2 (2017), 314–349.

Search This Blog

Big Graph Live

Comments

Post a Comment

Popular posts from this blog

Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network

Archive