Multi-relation Graph Summarization
Background: Multi-relation Graphs
Multi-relation networks (also known
as multi-layer, multiplex, or multi-dimensional networks) are graphs where
multiple edges of different types may exist between any pair of nodes [7].
Multi-relation graphs are an expressive model of real-world activities, in which
a relation can be a topic in social networks, an interaction type in genetic
networks, or a snapshot in temporal graphs. For instance, BioGRID
(thebiogrid.org) describes seven different types of genetic interactions
between genes in Homo Sapiens. STRING (string-db.org) models protein-to-protein
interactions with six types of correlations statistically learned from existing
protein databases, revealing that most protein interactions are associated with
at least two types of correlations. Other applications where multiple relations
exist between entities include social and financial networks [1], urban and transportation
systems [2], ecology research [3], and recommender systems [4, 5, 6].
Multi-relation graph model is also attracting interest in graph analytics, such
as in shortest path finding [8], core decomposition and densest subgraph
discovery [1, 9], node clustering [10], frequent subgraphs mining [11], and in
social networks analysis [12].
Background: Graph Summary
Graph summarization is beneficial in a wide range of applications, such as visualization, interactive and exploratory analysis, approximate query processing, reducing the on-disk storage footprint, graph embedding, and graph processing in modern hardware [13, 14, 15, 16, 17]. Navlakha et al. [18] studied lossless summarization for simple undirected single-relation graphs. Given a graph, a lossless summary consists of a summary graph with supernodes and superedges, and a set of (positive/ negative) edge corrections, where given the (lossless) summary, we can reconstruct the original graph exactly, by (1) “exploding” each superedge, (2) adding each edge in positive edge corrections , and (3) removing the edges in negative edge corrections . An example of lossless summary is provided in Figure 1. The problem studied by Navlakha et al. is to find the smallest possible summary of an input graph, i.e., the one that minimizes the number of superedges and the number of edge corrections. To solve it, they proposed a simple greedy agglomerative heuristic without quality guarantee.
Figure 1: (a) A simple graph and (b) its lossless summary consisting of two supernodes, one superedge, and a negative correction edgeA similar approach was followed by LeFevre and Terzi [19] who studied summaries obtained by aggregating nodes into supernodes. However, the method keeps on each superedge the information about how many original edges it represents (but not which ones): this is clearly a lossy summarization. The objective is then to find the summary minimizing the loss for a given number ๐ of allowed supernodes (which implicitly controls the compression rate). The loss is represented by the reconstruction error, i.e., the difference between the original graph and the probabilistic graph that one can reconstruct form the summary. Similar to [18], [19] proposed a simple greedy agglomerative heuristic with no quality guarantee. Later, Riondato et al. [20, 21] proposed the first polynomial-time approximation algorithm for the problem of [19]. In this work we show that, following a similar intuition, we can achieve the first polynomial-time approximation algorithm for the classic lossless summarization problem of [18].
Our Problem: Multi-relation Graph Summary
In this paper, we study, for the
first time, the problem of multi-relation graph summarization. In particular,
we extend the notion of lossless summarization over multi-relation graphs. An undirected,
multi-relation graph ๐บ is a triplet (๐, ๐ธ, ๐
), where ๐ is a set of nodes, ๐
is a set of relations, and ๐ธ ⊆ ๐ ×๐ × ๐
is a set of undirected edges. Therefore, in a multi-relation graph, each
edge is a triplet: e.g., an edge between nodes ๐ข and ๐ฃ in relation ๐ ∈ ๐
is represented by (๐ข, ๐ฃ, ๐). A summary is defined as in the
single-relation case, the only difference is that the correction edges are
triplets and also superedges are now triplets. Our problem is to find the
smallest possible summary of an input graph, i.e., the one that minimizes the
number of superedges and the number of edge corrections.
Figure 2: (a) A multi-relation graph and (b) its lossless summary
Figure 2 provides an example of a
multi-relation graph and its summary. The graph on the left-hand side is
defined over 3 relations, contains 5 nodes and 16 edges. The summary on the right-hand
side is obtained by grouping {๐, ๐} and {๐, ๐} as two supernodes and keeping {๐} as a supernode over all 3 relations. No correction is required here.
The cost of such summary is thus 6 (given by 6 superedges + 0 correction),
while the cost of the original graph was 16 (i.e., 16 edges).
Why not keeping an individual summary for each relation? Figure 3 shows the optimal summary
for each relation of the multi-relation graph in Figure 2(a). One could argue
that storing these three individual summaries would also serve as a lossless summarization
for the given multi-relation graph. However, this is not a good option due to
two reasons: (1) The optimal multi-relation summary in Figure 2(b) provides us
more insights about the input network. For example, by only looking at the
individual summaries in Figure 3, we cannot easily determine the fact that the
nodes in set {๐, ๐} are fully connected with the nodes in set {๐, ๐} via all relations. However, the
nodes within each of these sets interact with themselves in different ways.
Thus, it is better to characterize them as two strongly connected supernodes
with different self-loops as in the multi-relation summary in Figure 2(b). (2)
More storage is required for maintaining all individual summaries, because each
individual summary might require storing of a different node mapping (i.e., the
mapping from nodes to supernodes).
Our Contributions
Our main contribution is to initiate
investigation into multi-relation graph summarization. Besides, our work
achieves the following contributions:
• We revise the classic
single-relation graph lossless summarization problem of [18], and provide the first
polynomial-time approximation algorithm.
• For multi-relation lossless graph
summarization, we design basic two-step algorithms, which first generate a
lossless summary for each relation, then properly aggregate them to obtain one
uniform summary. We also highlight the limits of this approach.
• We next propose holistic
algorithms for more compact and efficient summary generation. Our holistic ๐-Median+
algorithm maintains the same approximation guarantee for multi-relation graph
summarization.
• We combine the traditional Greedy
method [18] and the approximation soluton ๐-Median as the final proposed algorithm Hybrid, which empirically produces the most compact summary.
• Our empirical evaluation on four
real-world networks confirms that our holistic algorithms can produce more
compact summaries and are faster than the two-step approaches.
• Real-world applications on
visualization, classification, and query processing demonstrate the
effectiveness and efficiency of our proposal.
Case Study: Visualization on DBLP Collaboration Network
We visualize the summary of DBLP
dataset, and present some interesting case studies in Figure 4. In the first
case shown in Figure 4(a), we find that all researchers in each supernode
actively collaborate with others on the topics such as “data stream” and
“approximation”. They have both self-loops and links to each other with such
topics. However, a few different topics also appear within each group. For
example, Graham Cormode also has quite a few “private”-related (i.e., privacy) joint
works with Divesh Srivastava. Another interesting finding is that both the
left-side and the top supernodes contain researchers from database community,
since they have more publications in SIGMOD and VLDB, while the right-side
supernode consists of theory community researchers, who publish more in STOC,
SODA, etc.
Figure 4: Case study on DBLP
The second case in Figure 4(b)
presents a group of people with close collaboration. One can verify that they
are all from same geographical location, i.e., Hong Kong in this example. This
is a frequent pattern in DBLP.
Figure 4(c) first shows that very senior researchers tend to be kept as a single node, since they work on quite diverse topics and with many researchers. Here, Jiawei Han and his ex-students, Jian Pei and Xifeng Yan, are all kept as a singleton supernode, and Jiawei Han has different collaboration topics with them. His recently graduated student, Xiang Ren (in 2018), and some other current students are grouped together, since they collaborate frequently with each other on topics related to knowledge extraction. Such correlations across different relations and node set could not be immediately inferred if one keeps individual summaries for all relations separately.
References
[1] E. Galimberti, F. Bonchi, and F. Gullo. 2017. Core Decomposition and
Densest Subgraph in Multilayer Networks. In CIKM.
[2] A. Cardillo, J. Gรณmez-Gardeรฑes, M. Zanin, M. Romance, D. Papo, F.
del Pozo, and S. Boccaletti. 2012. Emergence of Network Features from
Multiplexity. CoRR abs/1212.2153 (2012).
[3] M. Stella, C. S. Andreazzi, S. Selakovic, A. Goudarzi, and A. Antonioni.
2017. Parasite Spreading in Spatial Ecological Multiplex Networks. J. Complex
Networks 5, 3 (2017), 486–511.
[4] B. Jin, C. Gao, X. He, D. Jin, and Y. Li. 2020. Multi-behavior
Recommendation with Graph Convolutional Networks. In SIGIR.
[5] M. Mao, J. Lu, G. Zhang, and J. Zhang. 2017. Multi-relational Social
Recommendations via Multigraph Ranking. IEEE Trans. Cybernetics 47, 12 (2017),
4049–4061.
[6] L. Xia, C. Huang, Y. Xu, P. Dai, X. Zhang, H. Yang, J. Pei, and L. Bo.
2021. Knowledge-enhanced Hierarchical Graph Transformer Network for
Multi-behavior Recommendation. In AAAI.
[7] M. E. Dickison, M. Magnani, and L. Rossi. 2016. Multilayer Social
Networks. Cambridge University Press.
[8] X. Zhang and T. รzsu. 2019. Correlation Constraint Shortest Path over
Large Multi-Relation Graphs. PVLDB 12, 5 (2019), 488–501.
[9] E. Galimberti, F. Bonchi, F. Gullo, and T. Lanciano. 2020. Core
Decomposition in Multilayer Networks: Theory, Algorithms, and Applications. ACM
Trans. Knowl. Discov. Data 14, 1 (2020), 11:1–11:40.
[10] B. Boden, S. Gรผnnemann, H. Hoffmann, and T. Seidl. 2012. Mining
Coherent Subgraphs in Multi-layer Graphs with Edge Labels. In KDD.
[11] X. Yan, X. J. Zhou, and J. Han. 2005. Mining Closed Relational
Graphs with Connectivity Constraints. In KDD.
[12] M. Coscia, G. Rossetti, D. Pennacchioli, D. Ceccarelli, and F.
Giannotti. 2013. "You Know because I Know": A Multidimensional Network
Approach to Human Resources Problem. In ASONAM.
[13] M. Besta and T. Hoefler. 2018. Survey and Taxonomy of Lossless
Graph Compression and Space-Efficient Graph Representations. CoRR
abs/1806.01799 (2018).
[14] A. Khan, S. S. Bhowmick, and F. Bonchi. 2017. Summarizing Static and
Dynamic Big Graphs. PVLDB 10, 12 (2017), 1981–1984.
[15] D. Koutra, J. Vreeken, and F. Bonchi. 2018. Summarizing Graphs at
Multiple Scales: New Trends. In ICDM.
[16] Y. Liu, T. Safavi, A. Dighe, and D. Koutra. 2018. Graph
Summarization Methods and Applications: A Survey. ACM Comput. Surv. 51, 3
(2018), 62:1–62:34.
[17] J. Yang, J. You, and X. Wan. 2021. Graph Embedding via Graph
Summarization. IEEE Access 9 (2021), 45163–45174.
[18] S. Navlakha, R. Rastogi, and N. Shrivastava. 2008. Graph
Summarization with Bounded Error. In SIGMOD.
[19] K. LeFevre and E. Terzi. 2010. GraSS: Graph Structure
Summarization. In SDM.
[20] M. Riondato, D. Garcรญa-Soriano, and F. Bonchi. 2014. Graph
Summarization with Quality Guarantees. In ICDM.
[21] M. Riondato, D. Garcรญa-Soriano, and F. Bonchi. 2017. Graph
Summarization with Quality Guarantees. Data Min. Knowl. Discov. 31, 2 (2017),
314–349.
Comments
Post a Comment