Distributed Graph Embedding with Information-Oriented Random Walks

A summary of the PVLDB 2023 research paper by Peng Fang, Arijit Khan, Siqiang Luo, Fang Wang, Dan Feng, Zhenli Li, Wei Yin, and Yuchao Cao

Background [Graph Embedding]: Graph embedding maps graph nodes to low-dimensional vectors and is widely adopted in machine learning tasks such as link prediction [1], node classification [2], clustering [3], and recommendation [4]. Sequential graph embedding techniques [5] fall into three categories. (1) Matrix factorization-based algorithms [6, 7, 8, 9, 10] construct feature representations based on the adjacency or Laplacian matrix and involve spectral techniques [11]. (2) Graph neural networks (GNNs)-based approaches [12, 13, 14, 15, 16] focus on generalizing graph spectra into semi-supervised or supervised graph learning. Both techniques incur high computational overhead and DRAM dependencies, limiting their scalability to large graphs. (3) A plethora of random walk-based graph embedding solutions [17, 18, 19, 20, 21] are also proposed. A random walk is a graph traversal that starts from a source node, jumps to a neighboring node at each step, and stops after a few steps. Random walk-based embeddings are inspired by the well-known natural language processing model, word2vec [22]. By conducting sufficient random walks on graphs, substantial graph structural information is collected and fed into the word2vec (Skip-Gram) to generate node embeddings. Compared with other graph embedding solutions such as graph neural networks and matrix factorization techniques, random walk-based methods are more flexible, parallel-friendly, and scale to larger graphs [23].

Motivation [Scalable Graph Embedding]: The increasing availability of billion-edge graphs underscores the importance of learning efficient and effective embeddings on large graphs. For instance, the Twitter graph includes over 41 million user nodes and over one billion edges, and it has extensive requirements for link prediction and classification tasks [24]. The graph of users and products at Alibaba also consists of more than two billion user-product edges, which forms a giant bipartite graph for its recommendation tasks [25]. The inherent challenge is that the number of random walks required increases with the graph size. For example, one representative work, node2vec [18] needs to sample many node pairs to ensure the embedding quality, it takes months to learn node embeddings for a graph with 100 million nodes and 500 million edges by 20 threads on a modern server [10]. Some very recent work, e.g., HuGE [17] attempts to improve the quality of random walks according to the importance of nodes. Though this method can remove redundant random walks to a great extent, the inherent complexity remains similar. It still requires more than one week to learn embeddings for a billion-edge Twitter graph on a modern server, hindering its adoption to real-world applications.

Another line of work turns to use GPUs for efficient graph embedding. For example, some recent graph embedding frameworks (e.g., [27, 28]) simultaneously perform graph random walks on CPUs and embedding training on GPUs. However, as the computing power between GPUs and CPUs differ widely, it is typically hard for the random walk procedure performed on CPUs to catch up with the embedding computation performed on GPUs, causing bottlenecks [29, 30]. Furthermore, this process is heavily related to GPUs’ computing and memory capacity, which can be drastically different across different servers.

Recently, distributed graph embedding, or computing graph embeddings with multiple machines, has attracted significant research interest to address the scalability issue. Examples include KnightKing [31], Pytorch-BigGraph [32], and DistDGL [26]. KnightKing [31] optimizes the walk-forwarding process for node2vec and brings up several orders of magnitude improvement compared to a single server solution. However, it may suffer from redundant or insufficient random walks that are attributed to a routine random walk setting (usually, walk length 𝐿 = 80 and number of walks 𝑟 per node = 10), resulting in low-quality training information for the downstream task [17]. Moreover, the workload-balancing graph partitioning scheme that it leverages fails to consider the randomness inherent in random walks, introducing higher communication costs across machines and degrading its performance. Facebook proposes Pytorch-BigGraph [32] that leverages graph partitioning technique and parameter server to learn large graph embedding on multiple CPUs based on PyTorch. However, the parameter server used in this framework needs to synchronize embeddings with clients, which puts more load on the communication network and limits its scalability. Amazon has recently released DistDGL [26], a distributed graph embedding framework for graph neural network model. However, its efficiency is bogged down by the graph sampling operation, e.g., more than 80% of the overhead is for sampling in the GraphSAGE model [12], and the mini-batch sampling used may trigger delays in gradient updates causing inefficient synchronization. In conclusion, although the distributed computation frameworks have shown better performance than the single-server and CPU-GPU-based solutions, significant rooms exist for further improvement.

Our Proposed DistGER System: We present a newly designed distributed graph embedding system, DistGER (Figure 1), which incorporates more effective information-centric random walks such as HuGE [17] and achieves a super-fast graph embedding compared to state-of-the-arts. Unlike a routine random walk configuration (usually, walk length 𝐿 = 80 and number of walks 𝑟 per node = 10) to generate walks, we measure the effectiveness of information during walk based on entropy, and decides the optimal walk length and number of walks per node. Due to such information-centric random walks, DistGER embedding also shows higher effectiveness when applied to downstream tasks.


Three novel contributions of DistGER are as follows.

First, since the information-centric random walk requires measuring the effectiveness of the generated walk on-the-fly during the walking procedure, it inevitably introduces higher computation and communication costs in a distributed setting. DistGER resolves this by showing that the effectiveness of a walking path can be measured through incremental information, avoiding the need for full-path information. DistGER invents incremental information-centric computing (InCoM), ensuring O(1) time for on-the-fly measurement and maintains constant-size messages across computing machines.

Second, considering the randomness inherent in random walks and the workload balancing requirement, DistGER proposes multi-proximity aware, streaming, parallel graph partitioning (MPGP) that is adaptive to random walk characteristics, increasing the local partition utilization. Meanwhile, it uses a dynamic workload constraint for the partitioning strategy to ensure load-balancing.

Finally, different from the existing random walk-based embedding techniques, DistGER designs a distributed Skip-Gram learning model (DSGL) to generate node embeddings and implements an end-to-end distributed graph embedding system. Precisely, DSGL leverages global-matrices and two local-buffers for node vectors to improve the access locality, thus reducing cache lines ping-ponging across multiple cores during model updates; then develops multi-windows shared-negative samples computation to fully exploit the CPU throughput. Moreover, a hotness block-based synchronization mechanism is proposed to synchronize node vectors efficiently in a distributed setting.

We conduct extensive experiments on five large, real-world graphs to demonstrate that DistGER achieves much better efficiency, scalability, and effectiveness over existing popular distributed frameworks, e.g., KnightKing [31], DistDGL [26], and Pytorch-BigGraph [32] (Figures 2, 3). As a preview, compared to KnightKing, Pytorch-BigGraph, and DistDGL, our DistGER achieves 9.3×, 26.2×, and 51.9× faster embedding on average, and easily scales to billion-edge graphs. In addition, DistGER generalizes well to other random walk-based graph embedding methods.


Figure 2: Efficiency (a) and Scalability (b): PBG, DistDGL, KnightKing, HuGE-D (baseline), DistGER (ours) 


                                         (a)                                                                       (b)

Figure 3: (a) Scalability of DistGER on synthetic graphs, where Y-axis is in log-scale. The running times for six real-world graphs are also inserted into the plot, which is consistent with the trend on synthetic data. (b) The influence of running time on embedding quality for DistGER and competitors.

For more information, click here for our paper.

References:

[1] X. Wei, L. Xu, B. Cao, and P. S. Yu. 2017. Cross View Link Prediction by Learning Noise-resilient Representation Consensus. In WWW.

[2] S. Bhagat, G. Cormode, and S. Muthukrishnan. 2011. Node Classification in Social Networks. In Social Network Data Analytics, Charu C. Aggarwal (Ed.). Springer, 115–148.

[3] F. Nie, W. Zhu, and X. Li. 2017. Unsupervised Large Graph Embedding. In AAAI.

[4] C. Shi, B. Hu, W. X. Zhao, and P. S. Yu. 2019. Heterogeneous Information Network Embedding for Recommendation. IEEE Trans. Knowl. Data Eng. 31, 2 (2019), 357–370.

[5] H. Cai, V. W. Zheng, and K. C.-C. Chang. 2018. A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications. IEEE Trans. Knowl. Data Eng. 30, 9 (2018), 1616–1637.

[6] J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang, and J. Tang. 2019. NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization. In WWW.

[7] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang. 2018. Network Embedding as Matrix Factorization: Unifying Deep Walk, LINE, PTE, and node2vec. In WSDM.

[8] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and S. Yang. 2017. Community Preserving Network Embedding. In AAAI.

[9] R. Yang, J. Shi, X. Xiao, Y. Yang, and S. S. Bhowmick. 2020. Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank. Proc. VLDB Endow. 13, 5(2020), 670–683.

[10] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M. Ding. 2019. ProNE: Fast and Scalable Network Representation Learning. In IJCAI.

[11] M. Belkin and P. Niyogi. 2001. Laplacian Eigen maps and Spectral Techniques for Embedding and Clustering. In NeurIPS.

[12] W. L. Hamilton, Z. Ying, and J. Leskovec. 2017. Inductive Representation Learning on Large Graphs. In NeurIPS.

[13] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu. 2018. Deep Recursive Network Embedding with Regular Equivalence. In KDD.

[14] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio. 2018. Graph Attention Networks. In ICLR.

[15] D. Wang, P. Cui, and W. Zhu. 2016. Structural Deep Network Embedding. In KDD.

[16] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang, F. Zhang, X. Xie ,and M. Guo. 2018. GraphGAN: Graph Representation Learning With Generative Adversarial Nets. In AAAI.

[17] P. Fang, F. Wang, Z. Shi, H. Jiang, D. Feng, and L. Yang. 2021. HuGE:An Entropy-driven Approach to Efficient and Scalable Graph Embeddings. In ICDE.

[18] A. Grover and J. Leskovec. 2016. Node2vec: Scalable Feature Learning for Networks. In KDD.

[19] B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. DeepWalk: Online Learning of Social Representations .In KDD.

[20] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. 2015. LINE: Large-scale Information Network Embedding. In WWW.

[21] A. Tsitsulin, D. Mottin, P. Karras, and E. Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In WWW.

[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality. In NeurIPS.

[23] K. Yang, X. Ma, S. Thirumuruganathan, K. Chen, and Y. Wu. 2021. Random Walks on Huge Graphs at Cache Efficiency. In SOSP.

[24] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. 2013. WTF: The Who to Follow Service at Twitter. In WWW.

[25] J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao, and D. L. Lee. 2018. Billion scale Commodity Embedding for E-commerce Recommendation in Alibaba. In SIGKDD.

[26] D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X. Song, Q. Gan, Z. Zhang, and G. Karypis. 2020. DistDGL: Distributed Graph Neural Network Training for Billion-Scale Graphs. In IA3@SC.

[27] W. Wei, Y. Wang, P. Gao, S. Sun, and D. Yu. 2020. A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent. CoRR abs/2005.13789 (2020).

[28] Z. Zhu, S. Xu, J. Tang, and M. Qu. 2019. GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding. In WWW.

[29] M. Serafini. 2021. Scalable Graph Neural Network Training: The Case for Sampling. ACM SIGOPSO per. Syst. Rev. 55, 1 (2021), 68–76.

[30] C. Zheng, H. Chen, Y. Cheng, Z. Song, Y. Wu, C. Li, J. Cheng, H. Yang, and S. Zhang. 2022. ByteGNN: Efficient Graph Neural Network Training at Large Scale. Proc. VLDB Endow. 15, 6 (2022), 1228–1242.

[31] K. Yang, M. Zhang, K. Chen, X. Ma, Y. Bai, and Y. Jiang. 2019. KnightKing: A Fast Distributed Graph Random Walk Engine. In SOSP.

[32] A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt, A. Bose, and A. Peysakhovich. 2019. Pytorch-BigGraph: A Large Scale Graph Embedding System. In MLSys.

Comments

Popular posts from this blog

Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network