Semantic Guided and Response Times Bounded Top-k Similarity Search over Knowledge Graphs

Semantic Guided and Response Times Bounded Top-k Similarity Search over Knowledge Graphs

A summary of the IEEE ICDE 2020 research paper by Yuxiang Wang, Arijit Khan, Tianxing Wu, Jiahui Jin, and Haijiang Yan.

Background: Knowledge graphs (such as DBpedia [1], Yago [2], and Freebase [3]) have been constructed in recent years, managing large-scale and real-world facts as a graph [4]. In such graphs, each node represents an entity with attributes, and each edge denotes a relationship between two entities. 

Answering Graph Query on Knowledge Graphs: Querying knowledge graphs is essential for a wide range of applications, e.g., question answering and semantic search [5]. For example, consider that a user wants to find all cars produced in Germany. One can come up with a reasonable graph representation of this query as a query graph GQ, and identify the exact or approximate matches of GQ in a knowledge graph G using graph query models [6]–[10]. Correct answers can be returned, such as <BMW 320, assembly, Germany>. Graph query also acts as a fundamental component for other query forms, such as keyword and natural language query [9]. We can reduce these query forms to a graph query by translating input text to a query graph [11], [12].

To retrieve the information of interest from a knowledge graph G, users are often required to have full knowledge of the vocabulary used in G [13], as well as the underlying schemas defined in G, which is difficult for ordinary users (even professional users) [14]. Otherwise, the user-built query graph is likely to be structurally different from the predefined schemas, thus fails to return correct answers due to the mismatch with the query graph. Consider the following motivating example.

Consider the query: Find all cars that are produced in Germany (Q117 from QALD-4 benchmark [15]). Figure 1 provides four correct graph matches with different schemas in DBpedia. Each one is represented as an n-hop path. An ordinary user may build a query graph G1Q based on the phrases from the natural language question [12]. While a professional user may build G2Q by using the controlled vocabulary (e.g., Automobile is used to represent vehicles in DBpedia) and a schema she already knew. However, both query graphs suffer from the structural mismatch problem.

Fig. 1: An example of structural mismatch with query graphs: different users may employ different query graphs (top) to find all cars made in Germany. Four correct graph matches in DBpedia are provided (bottom). Only G2Q can retrieve partial correct answers having the same schemas as 1 , because the 1-hop edge assembly cannot match any n-hop (n > 1) paths.

Challenges: 

Mismatch in query nodes. In G1Q, a query node with type Car represents the phrase “cars”. However, no entity in DBpedia has the same, or even a textually similar type for Car, because it is not a term belonging to the controlled vocabulary of DBpedia. Hence, G1Q fails to find correct answers.

Mismatch in query edges. For G2Q, the user can retrieve 234 answers that have the same schema as graph match 1. However, more than 200 correct answers are ignored, because a 1-hop edge in G2Q cannot be mapped to the semantically similar n-hop (n > 1) paths (edge-to-path mapping).

If a user has full knowledge about DBpedia, then she can build various query graphs that cover all possible schemas, to obtain all cars of interest. Generally, it is a strong assumption. As an alternative, we aim to provide a graph query system that will be able to support different query graphs without forcing users to use very controlled vocabulary or be knowledgeable about the dataset. This motivates us to fill the structural gap in graph matching by considering the semantics of query graphs. 

Our Solution for Semantic-Guided Search: In this work, we develop a semantic-guided search to find the semantically similar paths to query edges without prior knowledge. To the best of our knowledge, we are the first to support semantically edge-to-path mapping in graph query. Moreover, we handle  query node mismatch via a node transformation library. Figure 2 shows the pipeline of our approach, which has an offline and an online phase.

Fig. 2: A running example of our approach, including offline KG embedding (bottom right) and online query processing. Four components of the online part are: (1) Query graph decomposition, (2) Semantic graph construction, (3) Semantic-guided search, and (4) Threshold Algorithm-based assembly. An approximate optimization is applied for response-time-bounded query. All predicates in the example knowledge graph are provided in a table (bottom middle).

Offline phase. Given a knowledge graph G, we leverage a knowledge graph embedding model [4], [18] to represent the predicates of G in a vector space E. Hence, the semantic similarities of predicates can be easily obtained through vector calculation, and this makes it possible to achieve the semantic similarities of a path in G to a query edge in GQ. To be precise, this is essential for semantically edge-to-path mapping.

Online phase. In Figure 2, we take a query graph GQ as an input. We adopt a decomposition-assembly framework for GQ, which consists of four basic components.

(1) Query graph decomposition. We first decompose GQ into a set of sub-query graphs {g1...gn} by a dynamic programming algorithm, subject to minimizing the possible query cost. 

(2) Semantic graph construction. To support the semantically edge-to-path mapping, we then construct a semantic graph SGQ for each sub-query graph gi by preserving the predicate semantic similarities on the edges of G. In Figure 2, we show example semantic graphs for g1 and g2. For instance, each blue edge has a high semantic similarity to the query edge product, e.g., e2 (assembly) has 0.98 semantic similarity to product. By utilizing these similarities in SGQ, we define the path semantic similarity (pss) to measure how semantically similar a path in G is to gi.

(3) Semantic-guided search. We next present an A* semantic search to find the top-k semantically similar matches for each sub-query graph gi from the semantic graph SGQ based on the path semantic similarity (pss). To improve the efficiency, we propose a well-designed heuristic estimation function of pss that prunes the search space. We prove the effectiveness guarantee of our A* semantic search, that is, the matches with the greatest pss must be found.

(4) Threshold Algorithm (TA)-based assembly. Finally, we assemble the matches of all sub-query graphs based on the threshold algorithm (TA) [17], in order to form the final answers for GQ.

Response-Time Bounded Search: Another crucial problem involves improving the system response time (SRT) for a graph query. SRT is the amount of time that a user waits before viewing results [16]. A shorter SRT usually indicates a better user experience. To the best of our knowledge, no current state-of-the-art work supports response-time-bounded graph query. In this work, we present an approximate optimization on our semantic-guided search to enable a trade-off between accuracy and the system response time (SRT) within a user-specified time bound T. As more time is given, more high-quality matches are refined incrementally. We prove that the globally optimal matches for GQ can be achieved theoretically when sufficient time is given.

In summary, we blend semantic-guided and response-time bounded characteristics in one system to support the top-k graph query over a knowledge graph effectively and efficiently. We have evaluated the effectiveness and efficiency of our approach through extensive experiments on three real-world and large-scale knowledge graphs.

For more information, click here for our paper.  Codes and Datasets: GitHub.
Blog post contributed by: Arijit Khan
  
[Reference]

[1] P. N. Mendes, M. Jakob, and C. Bizer, “Dbpedia: A multilingual crossdomain knowledge base.” in LREC, 2012, pp. 1813–1817. 

[2] J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum, “Yago2: A spatially and temporally enhanced knowledge base from wikipedia,” Artificial Intelligence, vol. 194, pp. 28–61, 2013. 

[3] K. Bollacker, R. Cook, and P. Tufts, “Freebase: A shared database of structured general human knowledge,” in AAAI, 2007, pp. 1962–1963. 

[4] X. Huang, J. Zhang, D. Li, and P. Li, “Knowledge graph embedding based question answering,” in WSDM, 2019, pp. 105–113. 

[5] R. V. Guha, R. McCool, and E. Miller, “Semantic search,” in WWW, 2003, pp. 700–709. 

[6] A. Khan, Y. Wu, C. C. Aggarwal, and X. Yan, “Nema: Fast graph search with label similarity,” PVLDB, vol. 6, no. 3, pp. 181–192, 2013. 

[7] L. Zou, R. Huang, H. Wang, J. X. Yu, W. He, and D. Zhao, “Natural language question answering over rdf: a graph data driven approach,” in SIGMOD, 2014, pp. 313–324. 

[8] S. Yang, Y. Wu, H. Sun, and X. Yan, “Schemaless and structureless graph querying,” PVLDB, vol. 7, no. 7, pp. 565–576, 2014. 

[9] S. Yang, F. Han, Y. Wu, and X. Yan, “Fast top-k search in knowledge graphs,” in ICDE, 2016, pp. 990–1001. 

[10] J. Jin, S. Khemarat, and L. Gao, “Querying web scale information networks via bounding matching scores,” in WWW, 2015, pp. 527–537. 

[11] W. Zheng, L. Zou, X. Lian, J. X. Yu, S. Song, and D. Zhao, “How to build templates for rdf question/answering: An uncertain graph similarity join approach,” in SIGMOD, 2015, pp. 1809–1824. 

[12] S. Han, L. Zou, J. X. Yu, and D. Zhao, “Keyword search on rdf graphs-a query graph assembly approach,” in CIKM, 2017, pp. 227–236. 

[13] S. Shekarpour, E. Marx, S. Auer, and A. Sheth, “Rquery: rewriting natural language queries on knowledge graphs to alleviate the vocabulary mismatch problem,” in AAAI, 2017, pp. 3936–3943.

[14] W. Zheng, L. Zou, W. Peng, X. Yan, S. Song, and D. Zhao, “Semantic sparql similarity search over rdf knowledge graphs,” PVLDB, vol. 9, no. 11, pp. 840–851, 2016. 

[15] “Qald-4,” http://qald.aksw.org/index.php?x=challenge&q=4. 

[16] S. S. Bhowmick, B. Choi, and S. Zhou, “Vogue: Towards a visual interaction-aware graph query processing framework,” in CIDR, 2013.

[17] R. Fagin, A. Lotem, and M. Naor, “Optimal aggregation algorithms for middleware,” in PODS, 2001.

[18] A. Bordes, N. Usunier, A. Garc´ıa-Duran, J. Weston, and O. Yakhnenko, “Translating embeddings for modeling multi-relational data,” in NIPS, 2013, pp. 2787–2795.

Comments

Popular posts from this blog

Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network