Distributed Graph Embedding with Information-Oriented Random Walks
A summary of the PVLDB 2023 research paper by
Peng Fang, Arijit Khan, Siqiang Luo, Fang Wang, Dan Feng, Zhenli Li, Wei Yin,
and Yuchao Cao
Background [Graph Embedding]: Graph embedding maps graph nodes
to low-dimensional vectors and is widely adopted in machine learning tasks such
as link prediction [1], node classification [2], clustering [3], and recommendation
[4]. Sequential graph embedding techniques [5] fall into three categories. (1) Matrix
factorization-based algorithms [6, 7, 8, 9, 10] construct feature representations
based on the adjacency or Laplacian matrix and involve spectral techniques [11].
(2) Graph neural networks (GNNs)-based approaches [12, 13, 14, 15, 16] focus on
generalizing graph spectra into semi-supervised or supervised graph learning. Both
techniques incur high computational overhead and DRAM dependencies, limiting their
scalability to large graphs. (3) A plethora of random walk-based graph embedding
solutions [17, 18, 19, 20, 21] are also proposed. A random walk is a graph traversal
that starts from a source node, jumps to a neighboring node at each step, and stops
after a few steps. Random walk-based embeddings are inspired by the well-known natural
language processing model, word2vec [22]. By conducting sufficient random walks
on graphs, substantial graph structural information is collected and fed into the
word2vec (Skip-Gram) to generate node embeddings. Compared with other graph embedding
solutions such as graph neural networks and matrix factorization techniques, random
walk-based methods are more flexible, parallel-friendly, and scale to larger graphs
[23].
Motivation [Scalable Graph
Embedding]: The increasing
availability of billion-edge graphs underscores the importance of learning efficient
and effective embeddings on large graphs. For instance, the Twitter graph includes
over 41 million user nodes and over one billion edges, and it has extensive
requirements for link prediction and classification tasks [24]. The graph of users
and products at Alibaba also consists of more than two billion user-product edges,
which forms a giant bipartite graph for its recommendation tasks [25]. The inherent
challenge is that the number of random walks required increases with the graph size.
For example, one representative work, node2vec [18] needs to sample many node pairs
to ensure the embedding quality, it takes months to learn node embeddings for a
graph with 100 million nodes and 500 million edges by 20 threads on a modern server
[10]. Some very recent work, e.g., HuGE [17] attempts to improve the quality of
random walks according to the importance of nodes. Though this method can remove
redundant random walks to a great extent, the inherent complexity remains similar.
It still requires more than one week to learn embeddings for a billion-edge Twitter
graph on a modern server, hindering its adoption to real-world applications.
Another line of work turns to use GPUs for efficient
graph embedding. For example, some recent graph embedding frameworks (e.g., [27,
28]) simultaneously perform graph random walks on CPUs and embedding training on
GPUs. However, as the computing power between GPUs and CPUs differ widely, it is
typically hard for the random walk procedure performed on CPUs to catch up with
the embedding computation performed on GPUs, causing bottlenecks [29, 30]. Furthermore,
this process is heavily related to GPUs’ computing and memory capacity, which can
be drastically different across different servers.
Recently, distributed graph embedding, or computing
graph embeddings with multiple machines, has attracted significant research
interest to address the scalability issue. Examples include KnightKing [31], Pytorch-BigGraph
[32], and DistDGL [26]. KnightKing [31] optimizes the walk-forwarding process for
node2vec and brings up several orders of magnitude improvement compared to a single
server solution. However, it may suffer from redundant or insufficient random walks
that are attributed to a routine random walk setting (usually, walk length 𝐿 = 80 and number of walks 𝑟 per node = 10), resulting in low-quality training information for the downstream task
[17]. Moreover, the workload-balancing graph partitioning scheme that it leverages
fails to consider the randomness inherent in random walks, introducing higher communication
costs across machines and degrading its performance. Facebook proposes Pytorch-BigGraph
[32] that leverages graph partitioning technique and parameter server to learn large
graph embedding on multiple CPUs based on PyTorch. However, the parameter server
used in this framework needs to synchronize embeddings with clients, which puts
more load on the communication network and limits its scalability. Amazon has recently
released DistDGL [26], a distributed graph embedding framework for graph neural
network model. However, its efficiency is bogged down by the graph sampling operation,
e.g., more than 80% of the overhead is for sampling in the GraphSAGE model [12],
and the mini-batch sampling used may trigger delays in gradient updates causing
inefficient synchronization. In conclusion, although the distributed computation
frameworks have shown better performance than the single-server and CPU-GPU-based
solutions, significant rooms exist for further improvement.
Our Proposed DistGER System: We present a newly designed distributed
graph embedding system, DistGER (Figure 1), which incorporates more effective information-centric
random walks such as HuGE [17] and achieves a super-fast graph embedding compared
to state-of-the-arts. Unlike a routine random walk configuration (usually, walk
length 𝐿 = 80 and number of walks 𝑟 per node = 10) to generate walks, we measure the effectiveness
of information during walk based on entropy, and decides the optimal walk length
and number of walks per node. Due to such information-centric random walks, DistGER
embedding also shows higher effectiveness when applied to downstream tasks.
Three novel contributions of DistGER are as follows.
First, since the information-centric random walk
requires measuring the effectiveness of the generated walk on-the-fly during the
walking procedure, it inevitably introduces higher computation and
communication costs in a distributed setting. DistGER resolves this by showing that
the effectiveness of a walking path can be measured through incremental information,
avoiding the need for full-path information. DistGER invents incremental information-centric
computing (InCoM), ensuring O(1) time for on-the-fly measurement and maintains constant-size
messages across computing machines.
Second, considering the randomness inherent in random
walks and the workload balancing requirement, DistGER proposes multi-proximity aware,
streaming, parallel graph partitioning (MPGP) that is adaptive to random walk characteristics,
increasing the local partition utilization. Meanwhile, it uses a dynamic workload
constraint for the partitioning strategy to ensure load-balancing.
Finally, different from the existing random walk-based
embedding techniques, DistGER designs a distributed Skip-Gram learning model (DSGL)
to generate node embeddings and implements an end-to-end distributed graph embedding
system. Precisely, DSGL leverages global-matrices and two local-buffers for node
vectors to improve the access locality, thus reducing cache lines ping-ponging across
multiple cores during model updates; then develops multi-windows shared-negative
samples computation to fully exploit the CPU throughput. Moreover, a hotness block-based
synchronization mechanism is proposed to synchronize node vectors efficiently in
a distributed setting.
We conduct extensive experiments on five large,
real-world graphs to demonstrate that DistGER achieves much better efficiency,
scalability, and effectiveness over existing popular distributed frameworks, e.g.,
KnightKing [31], DistDGL [26], and Pytorch-BigGraph [32] (Figures 2, 3). As a preview, compared
to KnightKing, Pytorch-BigGraph, and DistDGL, our DistGER achieves 9.3×, 26.2×,
and 51.9× faster embedding on average, and easily scales to billion-edge graphs.
In addition, DistGER generalizes well to other random walk-based graph embedding
methods.
Figure 2: Efficiency (a) and Scalability (b): PBG, DistDGL, KnightKing, HuGE-D (baseline), DistGER (ours)
(a) (b)
Figure 3: (a) Scalability of DistGER on synthetic graphs, where Y-axis is in log-scale. The running times for six real-world graphs are also inserted into the plot, which is consistent with the trend on synthetic data. (b) The influence of running time on embedding quality for DistGER and competitors.
For more information, click here for our paper.
References:
[1] X. Wei, L. Xu, B. Cao, and P. S. Yu. 2017. Cross
View Link Prediction by Learning Noise-resilient Representation Consensus. In WWW.
[2] S. Bhagat, G. Cormode, and S. Muthukrishnan.
2011. Node Classification in Social Networks. In Social Network Data Analytics,
Charu C. Aggarwal (Ed.). Springer, 115–148.
[3] F. Nie, W. Zhu, and X. Li. 2017. Unsupervised
Large Graph Embedding. In AAAI.
[4] C. Shi, B. Hu, W. X. Zhao, and P. S. Yu. 2019.
Heterogeneous Information Network Embedding for Recommendation. IEEE Trans. Knowl.
Data Eng. 31, 2 (2019), 357–370.
[5] H. Cai, V. W. Zheng, and K. C.-C. Chang. 2018.
A Comprehensive Survey of Graph Embedding: Problems, Techniques, and Applications.
IEEE Trans. Knowl. Data Eng. 30, 9 (2018), 1616–1637.
[6] J. Qiu, Y. Dong, H. Ma, J. Li, C. Wang, K. Wang,
and J. Tang. 2019. NetSMF: Large-Scale Network Embedding as Sparse Matrix Factorization.
In WWW.
[7] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and
J. Tang. 2018. Network Embedding as Matrix Factorization: Unifying Deep Walk, LINE,
PTE, and node2vec. In WSDM.
[8] X. Wang, P. Cui, J. Wang, J. Pei, W. Zhu, and
S. Yang. 2017. Community Preserving Network Embedding. In AAAI.
[9] R. Yang, J. Shi, X. Xiao, Y. Yang, and S. S.
Bhowmick. 2020. Homogeneous Network Embedding for Massive Graphs via Reweighted
Personalized PageRank. Proc. VLDB Endow. 13, 5(2020), 670–683.
[10] J. Zhang, Y. Dong, Y. Wang, J. Tang, and M.
Ding. 2019. ProNE: Fast and Scalable Network Representation Learning. In IJCAI.
[11] M. Belkin and P. Niyogi. 2001. Laplacian Eigen
maps and Spectral Techniques for Embedding and Clustering. In NeurIPS.
[12] W. L. Hamilton, Z. Ying, and J. Leskovec. 2017.
Inductive Representation Learning on Large Graphs. In NeurIPS.
[13] K. Tu, P. Cui, X. Wang, P. S. Yu, and W. Zhu.
2018. Deep Recursive Network Embedding with Regular Equivalence. In KDD.
[14] P. Velickovic, G. Cucurull, A. Casanova, A.
Romero, P. Liò, and Y. Bengio. 2018. Graph Attention Networks. In ICLR.
[15] D. Wang, P. Cui, and W. Zhu. 2016. Structural
Deep Network Embedding. In KDD.
[16] H. Wang, J. Wang, J. Wang, M. Zhao, W. Zhang,
F. Zhang, X. Xie ,and M. Guo. 2018. GraphGAN: Graph Representation Learning With
Generative Adversarial Nets. In AAAI.
[17] P. Fang, F. Wang, Z. Shi, H. Jiang, D. Feng,
and L. Yang. 2021. HuGE:An Entropy-driven Approach to Efficient and Scalable Graph
Embeddings. In ICDE.
[18] A. Grover and J. Leskovec. 2016. Node2vec:
Scalable Feature Learning for Networks. In KDD.
[19] B. Perozzi, R. Al-Rfou, and S. Skiena. 2014.
DeepWalk: Online Learning of Social Representations .In KDD.
[20] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan,
and Q. Mei. 2015. LINE: Large-scale Information Network Embedding. In WWW.
[21] A. Tsitsulin, D. Mottin, P. Karras, and E.
Müller. 2018. VERSE: Versatile Graph Embeddings from Similarity Measures. In WWW.
[22] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,
and J. Dean. 2013. Distributed Representations of Words and Phrases and their Compositionality.
In NeurIPS.
[23] K. Yang, X. Ma, S. Thirumuruganathan, K. Chen,
and Y. Wu. 2021. Random Walks on Huge Graphs at Cache Efficiency. In SOSP.
[24] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang,
and R. Zadeh. 2013. WTF: The Who to Follow Service at Twitter. In WWW.
[25] J. Wang, P. Huang, H. Zhao, Z. Zhang, B. Zhao,
and D. L. Lee. 2018. Billion scale Commodity Embedding for E-commerce Recommendation
in Alibaba. In SIGKDD.
[26] D. Zheng, C. Ma, M. Wang, J. Zhou, Q. Su, X.
Song, Q. Gan, Z. Zhang, and G. Karypis. 2020. DistDGL: Distributed Graph Neural
Network Training for Billion-Scale Graphs. In IA3@SC.
[27] W. Wei, Y. Wang, P. Gao, S. Sun, and D. Yu.
2020. A Distributed Multi-GPU System for Large-Scale Node Embedding at Tencent.
CoRR abs/2005.13789 (2020).
[28] Z. Zhu, S. Xu, J. Tang, and M. Qu. 2019. GraphVite:
A High-Performance CPU-GPU Hybrid System for Node Embedding. In WWW.
[29] M. Serafini. 2021. Scalable Graph Neural Network
Training: The Case for Sampling. ACM SIGOPSO per. Syst. Rev. 55, 1 (2021), 68–76.
[30] C. Zheng, H. Chen, Y. Cheng, Z. Song, Y. Wu,
C. Li, J. Cheng, H. Yang, and S. Zhang. 2022. ByteGNN: Efficient Graph Neural Network
Training at Large Scale. Proc. VLDB Endow. 15, 6 (2022), 1228–1242.
[31] K. Yang, M. Zhang, K. Chen, X. Ma, Y. Bai,
and Y. Jiang. 2019. KnightKing: A Fast Distributed Graph Random Walk Engine. In
SOSP.
[32] A. Lerer, L. Wu, J. Shen, T. Lacroix, L. Wehrstedt,
A. Bose, and A. Peysakhovich. 2019. Pytorch-BigGraph: A Large Scale Graph Embedding
System. In MLSys.
Comments
Post a Comment