exact matches of the query graph comparing to GraphQl. In our experiment, we queried SAPPER with Î¸ = 0 and We found that SAPPER may not be able to id...

0 downloads 0 Views 1MB Size

Abstract Nowadays, new applications have emerged as a result of the growth and development in technology. This has resulted into an increase in the amount of data that is in the form of graphs. Graphical data has become very important in a number of fields including software engineering, bioinformatics, biology, social networking among others. The need to manage this type of data in large databases has become paramount and therefore, techniques such as subgraph query have become very important for graph data management. In this project, we are studying an approximate subgraph matching method called SAPPER that aims to find all occurrences of the subgraph query matching in the context of noises such as missing edges. This method utilizes the pre-generated random spanning trees that are used to find the approximate subgraphs and the graph enumeration order that enumerates the subgraphs query that are isomorphic to the query graph. This project includes: an implementation, a discussion about this method and how it works, implementation issues that we faced, and some evaluation results.

1.

INTRODUCTION

subgraph query is one of the most fundamental procedures in managing, processing, and analyzing data in the form of graphs. In some applications such as biological networks, graphs are very large, which makes it difficult to process. In this case, the subgraph matching is used to identify the occurrences of the subgraph query in the database graph. This approach builds on the assumption that the graph database is cleaned and the results will be the exact match to the query graph. SAPPER studies the same problem of subgraph matching in the context of noise, it aims to find all approximate subgraph matches of the query graph in large graph. It builds based on the assumption that the set of the approximate matches of the query graph q is the superset of the exact matches. A subgraph s of the database graph is an approximate match for the query graph q if Dist(s,q) ≤ Θ: where s is a subgraph and Dist(s, q) is the number of edge that needed to transform one graph to another. In other words, the approximate query matches can be extended to be an exact match by finding all the subgraphs of the query graph first, then finding the exact matches of each subgraph. SAPPER uses random spanning trees and an enu-

meration order to improve the matching process. The spanning trees of a query graph q are isomorphic graphs to the query graph q only if Dist(s, q) ≤ Θ. SAPPER finds all occurrences of the subgraphs query that is no more than or equal to the given threshed Θ. Even though this set of subgraphs could be very large and highly overlapped, the subgraphs enumeration order is used to prune some of these subgraphs before matching them with the database graph. The lexicographical order is used to enumerate every subgraph s in a Depth-First enumeration Order. The subgraph that has at least one match in the database will be given an order, however, the subgraphs that have a prefix that is already enumerated, will not be enumerated again [3]. After the enumeration process, SAPPER matches the queries subgraph with the database graph. The rest of this paper is organized as follows: Section 2 includes one of the related works in the subgraph matching that is similar to SAPPER. Section 3 includes definitions and a simple dataset example that we used to clarify how SAPPER works. Section 4 is how to reprocess the database graph, construct the index, and process the queries. Section 5 includes the datasets and the experimental results. Section 6 is the conclusion and the future work.

2.

RELATED WORK

Querying a graph in a database graph is the process of finding all or a subset of the matches of a given query graph. There are several approaches in subgraph matching which are divided into two categories based on the index usage: The first category focuses on subgraph isomorphic and this kind of method does not use an index for the database graph such as subgraph isomorphism algorithms. The second category is an index-based graph matching that uses an index to maintain the database graph easily. This category can be divided into graph indexing and subgraph indexing. The main goal in the graph index is to find the exact matches of the query graph, while the subgraph indexing aims to find all or a subset of the approximate matches of a given query graph. Since SAPPER is a subgraph indexing method, our focus will be on the subgraph indexing approach. SAPPER is not the only method in the index based subgraph matching, there are several related works that are done in this area such as GADDI and TALE. In [1], Tian & Patel propose TALE (a Tool for Approximate subgraph Matching of Large Queries Efficiently) which is an approximate subgraph matching method that

uses a hybrid index structure called Neighborhood Index (NH-Index), to maintain information about every node in the database graph. This method identifies the nodes in the query graph by their importance using some measurement, then starts the matching process by matching the important node first, and then extends the matches to match the whole query graph. While GADDI [2] is an another index based approximate subgraph matching method that uses distance based index and enumerates all possible approximate isomorphic graphs to the query graph in large data.

3.

DEFINITIONS AND DATASET

This section includes some definitions and the basic dataset example that is used in the rest of the paper. Definition 1: AI(q , Θ) is a set of the subgraph queries that contains all isomorphic subgraphs s to the query graph q where Dist(s,q) ≤ Θ. Definition 2: G = (V, E) is the database graph that consists of V vertices and E edges. That stores the data in undirected weighted and labeled graph. Every vertex has unique ID and label. Definition 3: q = (Vq, Eq) is a query graph that consists of Vq vertices and Eq edges. Definition 4: N 1(v, G) is a set of vertices that connect directly to vertex v. In different words is the first neighbors to the vertex v. Definition 5: N 2(v, G) is a set of vertices that connect to v within two edges path or that are v’s neighbors’ neighbors.

Figure 1: The Database Graph G and The Query Graph q

4.

SAPPER

SAPPER’s process is divided into two main steps: Step 1: The hybrid neighborhood unit (HNU) index is used to store the local information about each vertex in the database graph with the help of the bloom filter to speed up the query processing. Step 2: The query processing to identify the approximate matches of the query graph q in the database graph G, this step includes four other sub-steps:

• Vertex matching to find all candidates matches for each vertex vq ∈ q in G. • Enumerating the query graph’s edges lexicographically from the smallest to the largest based on their occurrences in the database graph G. • Generate all possible subgraphs of the query graph that are θ or less edge distance away from q. Then, enumerate all these subgraphs in DepthFirst Enumeration Order. • For each subgraph, generate all possible random spanning trees possible, then match every subgraph with the database graph G. The whole SAPPER procedure in algorithm (8), and in this section, we studied and analyzed each process in detail.

4.1

Hybrid Neighborhood Unit Index

The hybrid neighborhood unit (HNU) is used to store the vertex’s degree, vertex’s label, and neighbors. It is like a table of vertex information with the associated neighbor information. In our implementation of HNU index, we collected the local information for every vertex in the database graph G including vertex’s degree, vertex’s label, vertex’s first neighbors N1(v, G) and the second neighbors N2(v, G). The set of the first neighbors N1(v, G) includes the direct neighbors of a vertex v, while computing the second neighbors of a vertex is not clear enough in the paper[3], if we have to include the first neighbors or not. In our implementation, we used a simple method called Graphs.neighborListOf (GraphDatabase, v), it is a method in JgraphT library that retrieves the direct neighbors of a vertex v. For the second neighbors we tried two cases: in the first case we only considered the vertices that are within two edges distance from v, and in the second case we included all the vertices that are within one and two edges distance from vertex v. As a result of testing both cases, the second case provided more accurate results than the first case, and we considered the result in the rest of the implementation. As indicated in the paper, the approximate number of vertices in the set of the second neighbors N2(v,G) could be two times the average degree of the database graph G, that is calculated as follows: Average Degree of G = 2 * |E|/|V | This average increases with the size of the database graph and the set of the second neighbors N2(v,G) increases as well. For this reason, SAPPER stores N2(G, v) in a bloom filter to ease finding the matching vertices in the database graph G. Definition 6: Bloom Filter T he bloom filter is a data structure that is used to store and test an element membership in a large set. Basically it contains a binary array and multiple hash functions. In HNU index we constructed a bloom filter for every vertex v in G to store N2 (v, G) set based on

two parameters as defined in the paper. The number of bits in the bloom filter equals to d9.6d2 e and the number of hash functions are seven to reduce the false positive rate to 0.01. For example, if we want to add the labels of the vertex V2 in the bloom filter: first, these labels have to pass through all the hash functions (h0, h1,.., h6) to get a result that represents the indices of these labels in the bloom filter, let’s say we get the following results. h0 (v2) =3, h1 (v2) =4 , h2 (v2) =6, h3(v2) =10 , h4(v2) =7, h5(v2) =14, h6(v2) =1. We go to each of those indices of the bloom filter B and set them to 1 as shown in figure(2).

Figure 2: The Representation For Bloom Filter Data Structure In our implementation, we used an open source Bloom Filter Library from ocs.guava libraries.googlecode.com /git/ javadoc/com/google/common/hash/ BloomFilter.html.

4.2

Query Processing

This step includes all processing that SAPPER uses in the subgraph matching process; which includes vertex matching to find all candidates matches for each vertex vq ∈ q in G, generate random spanning trees of q, enumerate all subgraphs in Depth-First enumeration order, and then the final graph matching process.

4.2.1

Vertex Matching

The vertex matching is used to find all matching vertices for each vertex in the query graph q. A vertex vG ∈ G is a match for a vertex vq ∈ q if the following conditions are satisfied:

Figure 3: The Query Graph Vertices’ Matches in G Based on The Vertex Matching Algorithm to consider the global information by generating random spanning trees of q using the random walk that is described in details in the next section.

4.2.2

Random Spanning Tree Generation and Matching

Since the spanning tree is a special form of a graph, in this step of SAPPER, we generated different random spanning trees of q to be able to prune some of the previous matches. Also, using the spanning trees to query the database graph G helps to reduce time consumption. Definition 7: Spanning Tree Let q be an undirected connected graph with V vertices and E edges. A spanning tree of q is a subgraph that includes all the vertices of q and (V − 1) edges. Definition 8: Random Walk T he random walk in an undirected graph q = (V, E) is a process that starts from a selected random vertex as a prime vertex and then randomly selects one of its neighbors. This process repeats on the selected vertex every time until we visit every vertex in q.

• If vq label is the same as vG label. • If vq degree is less than or equal to vG degree. • If the labels of the vertices in N1(vq,q), that represents the first neighbors, is a subset of the labels in the set of N1(vG, G) for VG. • If the labels of the vertices in N2(vq, q), that represents the neighbors within two edges distance from vq, is a subset of those of N2(vG,G). In our implementation, we implemented the vertex matching process as in algorithm (1).The result of the vertex matching is a set of candidate vertices from the database graph for each vertex in the query graph. Figure (3) shows the results of the vertex matching for our database graph example. Vertex Matching Results: Vertex (1) : [1, 2, 3, 20] Vertex (2) : [6, 7, 9, 11] Vertex (3) : [8, 10, 12, 13] Vertex (4) : [16, 18] Vertex (5) : [15, 17, 19] Although we found a lot of candidate matches for each vertex in the query graph, these matches are based on the local information that is given by the HNU index, and some of these matches could be false positives. To reduce the number of matches, we have

Definition 9: Markov Chain M arkov Chain is a discrete time stochastic process that represents a sequence of random variables or moves between different states S. In our case S = V1, V2,..., Vn is the set of vertices in the query graph q. In this chain the probability of transitioning from one state to another (from vertex v to vertex w) is represented in a matrix called the probability transition matrix P. This matrix is defined as follows: P(v,w)=1/d(v) if there is an edge between v and w, where d(v) is the degree of v. P = 0 if there is no edge between vertex v and w.

How to generate a random spanning tree: To generate a random spanning tree, we used a random walk on q that simulates a discrete-time Markov Chain with a finite sequence of moves between the vertices in q based on the probability that is defined by a transition matrix P. P (v,w) = 1 / d(v) is the probability of moving from one vertex (v) to another vertex (w). Every time we reached a new state or vertex, the probability will be based on the current state, not on the previous one. This is how the Markov chain works with the spanning tree. In our case, the set of vertices in the graph, S =V1, V3, V5 is the set of states and with undirected graph, the probability of

moving from v to w is the same as the probability of moving to vertex v, which means the Markov chain is symmetric. Due to the lack of details in the paper, we could not reach to the exact technique that they use to generate the random spanning trees to get |V (q) + 1| unique trees. Furthermore, the paper did not give any details about the method that they used to calculate the number of spanning tree, and if they consider the reversible tree as a different tree or not. We found that the number of the spanning trees of a graph could be different based on the structure of the query graph. A complete graph Kn is a graph in which each pair of vertices is connected to another while the cycle graph Cn is a graph that contains a single cycle through all the vertices. The number of spanning trees in the cycle graph Cn = n, where n is the number of vertices while in the complete graph is Kn= n(n-2). In our implementation, we generated the random spanning trees using the algorithm (2) and considered the number of the spanning tree of a graph equal to |V (q) + 1| as a fixed number. We also assumed that they consider the distinct spanning trees of the query graph q and the non-reversible tree. The figure below shows some of the generated random spanning trees for a subgraph of the query graph using algorithm (2).

Figure 4: Random Spanning Trees of The Subgraph e1 e2 e3 e4 e5. After we constructed all the spanning trees of a subgraph query, we matched these trees with the graph database G based on the tree matching algorithm (3) that is described in the following section.

Tree Matching Process: Since the tree is another form of a graph, matching a tree with the graph database is more efficient and simpler than matching a complete graph. This approach is based on the subgraph matching property: Definition 10: Subgraph Matching Property Given a query graph q and a database graph G, if q’ is a subgraph of q, then for any matching graph q” of q in the database G. q” must contain a match of q’. The tree matching process has a high pruning power to prune some of the vertex matches that we got from the vertex matching process. This process starts by selecting a random vertex from the query graph q to be a prime vertex in all the spanning trees. Then, based on the prime vertex it matches, every vertex in the tree with the graph database based on the candidates set for each vertex. The paper[3] does not give

details about the tree matching algorithm and in our implementation, we used algorithm (3). In this algorithm for every vertex we tried to match each vertex matches’ neighbors with the next vertex’s matches to locate the vertices that are near to each other in the graph database. This process is performed in DepthFirst manner until we matched the whole tree. Definition 11: Depth First Search (DFS) Depth-first search (DFS) is an algorithm uses to traverse a graph starting from the prime vertex that is considered as a root and goes as deep as possible until it reaches the leaf to visit every vertex in the tree. As described in the pseudo-code below: DFS(Prime Vertex v) { visit-Vertex(v); for each neighbor u of vertex v if u is not visited { DFS(u); add edge vu to the tree Ti }

} }

After applying this algorithm to our database example, we are supposed to have an accurate set of matches for each vertex in the tree represented by M (v, Ti), but we did not get a satisfied matching result, because of the lack of information in the paper. Until this process, We faced two confusing parts: The first part is how to find the set of N2 (v,G). As we discussed previously, we have two possible ways to compute the neighbors in N2: by considering only the vertices that exist within two edges distance from v or considering all the vertices that exist around vertex v that include first and second neighbors. The second part is in the vertex matching process. There are four conditions in the vertex matching process. Condition 3 and 4 are not stated clearly enough whether they have to be satisfied at the same time or considered one of them enough for two vertices to be a match. Also, these two confusing conditions affected the results in the tree matching phase. From these two unclear points, we derived four cases: A Consider a vertex that belongs to the set N2(v, G), if it is within two edges distance only. B Consider a vertex that belongs to the set N2(v, G), if it is within two edges distance or less, which means N2(v,G) includes N1(v, G) as well. C Consider a vertex vG is a match to a vertex Vq, if both satisfies conditions 3 or 4. D Consider a vertex vG is a match to a vertex Vq, if they satisfy conditions 3 and 4 at the same time. We tried all possible combinations of these four cases in our implementation and compared the results to select the one that is more reasonable: 1. A and C 2. A and D 3. B and C 4. B and D

After we had applied all these cases, we found that case 3 gave more accurate matches in both vertex matching and the tree matching results. For this reason, we considered the results that we got from this case in the rest of our implementation.

4.2.3

Query Graph Enumeration Order

In this process, first we enumerated the edges in the query graph lexicographically based on their occurrences in the database graph starting from the smallest to the largest. Then, we generated all possible subgraphs of the query graphs and enumerated them in Depth-First enumeration order for the final matching process.

Query Graph’s Edges Enumeration Process: In our implementation, we assigned a lexicographical order for each edge in the query graph q based on the number of edge’s matches in the database graph G. The order of the edge assigned lexicographically from the smallest to the largest according to the number matches. For example, if there are two edges e1 and e2, and e1 has a smaller number of matches than e2 (e1 < e2), then we give e1 a smaller order than e2. However, if there are two edges that have the same number of matches, the order assigns arbitrarily to those edges. We used the DFS to traverse the query graph q to count the number of matches for each edge. For example, we started with vertex v1 and retrieved v1’s matches set: let’s say M(v1)= x1, x2, x3. For each vertex connected to v1 in the query graph q for example; v2 and v2’s matches set is M (v2)= y1, y2. We matched all x1, x2 and x3’s neighbors with v2’s matches. In this case we are trying to see if the edge between v1 and v2 actually exists in the graph database G. The number of matches between these vertices is the number of occurrences of that edge in the graph database. This process is explained clearly in Algorithm (4). Result: The lexicographical order of the query graph’s edges.

Figure 5: The Query Graph’s Edges Enumeration Order.

In our implementation, we gave every edge an id that represents its order. These ids start from 1 to |E(q)|, which is number of edges in the query graph q as shown in table in figure (5). After we determined the order of the edges in the query graph q, we generated all subgraphs of the query graph that contain a different set of edges and enumerated them lexicographically smallest to largest using the Depth-First enumeration to start matching these

subgraphs with the graph database. To get all possible subgraphs ”sequences”, we considered the set of edges in the query graph as a sequence of edge (e1 e2 e3 e4 e5 e6), then generated all possible subsequences of that sequence. These subgraphs called AI(q , Θ).

Subgraphs Enumeration Order: Since SAPPER’s goal is to find all approximate matches of the query graph q, it matches all the subgraphs in the set AI(q , Θ), where each subgraph matches with one or more subgraphs in the database graph G. The number of subgraphs in this set are very large and highly overlapped. The approximate number of subgraphs could be O( m Θ ), where m is the number of edges in q. In our case, if Θ=2 then mΘ = 36 subgraphs; that includes the subgraphs within one missing edges ( mΘ = 6) and non missing edges. SAPPER aims to prune some of these subgraphs as early as possible before matching the database graph G by enumerate these subgraphs in Depth-First enumeration order .

Depth-First Enumeration: SAPPER uses Depth-First enumeration process to enumerate all subgraphs AI. This process starts from the smallest lexicographical sequence of edges that could be a subgraph of the query graph within Θ edge distance from q. In our example, if Θ=2, we started with q1= e1 e2 .... el (l = |E(q)| − Θ), and if we found at least one match of this sequence (subgraph) in the graph database, we enumerated this subgraph and continued adding an edge with the smallest lexicographical order after el to form a new sequence that represents a new subgraph let’s say q2= (e1 e2....el, e(l+1). If we found that q1 is a prefix for q2, then the set of the matches of q1 is the superset of q2’s matches. This will reduce the scope of the search in the graph database to find q2’s matches. Adding an edge to the current subgraph is called LEXI Next, that continues until no more edge can be added to the subgraph (sequence) or no more matches can be found in the database. Since the graphs that contain s as prefix are overlapped with the graph s, we only enumerated graph s if we found at least one match in the graph database. In case we did not find any matches for the current sequence (subgraph), we skip all the sub-graphs that contain this sequence as a prefix, and get started with a new sequence that does not contain s in the prefix for example, e1,e3, ...,el. This procedure is called LEXI Jump, it continues until we reach the last sequence that could be a subgraph e(l-2) e(l-1) el. Using the Depth-First enumeration order we could reach to every subgraph in AI(q,Θ) at least once. In the paper[3], the authors use LEXI Next and LEXI Jump to enumerate the subgraphs in DepthFirst enumeration and after we analyzed these processes, we found that they are basically used for sorting the subgraphs (sequences) based on their edges’ lexicographical order as shown in the figure (6). In our implementation, we considered the set of edges as a sequence of characters in a string to get all possible combinations of these edges taking in to consideration the lexicographical order for these edges. we created

Figure 6: LEXI Next and LEXI Jump Process

Figure 7: The Subgraphs Queries’ Structure in Depth-First Enumeration Order

all possible subgraphs first then we matched them in order. In our query graph example, the enumeration process started with the smallest sequence ”subgraph” that contains a set of edges with smallest lexicographical order (q1= e1 e2 e3 e4) where Θ=2, and ended with largest subgraph that includes edges with largest lexicographical order ( q =e3 e4 e5 e6). Figure (7), shows all candidates subgraphs of the query graph with less than or equal Θ missing edges with their enumeration order. In the following section we explained how SAPPER uses this order.

4.2.4

Graph Matching

In this phase, we processed all the subgraphs queries top-down and left-to-right order. For every subgraph we have to check several cases before matching the subgraph with the database graph: First: subgraph Connectivity: We have to check if the subgraph is connected or not using the DFS to traverse the subgraph and check if all the vertices connected to each other. If the number of vertices that

has been visited by the DFS is equal to the number of the vertices in the tested subgraph, then the subgraph is connected otherwise it’s not. In this case, we only considered the connected subgraphs. Second: Spanning Tree: we have to check if the subgraph can have a spanning tree or not. In this cases there are two scenarios to consider before searching for matches: In the first scenario, we have not searched any prefix of the subgraph query q1. In this case, we have to check if the subgraph can have a spanning tree or not by comparing the number of edges to the number of vertices in the subgraph. If the number of edges is greater than or equal to the number of vertices in the subgraph, then the subgraph can have more than one spanning tree. In this case, we generated some random spanning trees for that subgraph using Random Spanning Tree Algorithm (2). Then, searched for matches for each spanning tree using the Tree Matching Algorithm (3). For example, if the subgraph has T1, T2,..., Tr spanning trees and M(v,Ti) for all 1 ≤ i ≤ r are the matches of the prime vertex v in each

spanning tree. As a result the intersection of the prime vertex’s matches in all the spanning trees is used as a starting point to find the exact match of the subgraph query q. In the second case, if the number of edges is |V (q)+1|, then the subgraph is itself a tree, in this case, we searched for its matches using the Tree Matching Algorithm (3). The second scenario, we already have some matches of q1’s prefix. In this case, we extended the matches set of the prefix to search for the edges that belong to the current subgraph and not to the prefix. For example, if the first subgraph is q1= e1 e2 e3 e4 and the next is q2=e1 e2 e3 e4 e5, as we can see q1 is a prefix for q2, and if we found some matches of q1 in the database graph, then we extended the matches of q1 and search only for the edges that belong to q2 and not exist in q1, that is e6.

we checked this subgraph connectivity and the ability of having spanning trees. In this case, as we can see that the subgraph q2 is a connected subgraph and has a prefix of another discovered subgraph that is q1. In this case we used the previous matches of q1 to search for the undiscovered edges that exist in q2 only using algorithm(6). The results from this process are shown in figure (10).

Example: How this process works. SAPPER processes these subgraphs top-to-bottom then left-to-right. As shown in figure (8), the first subgraph is q1 = e1 e2 e3 e4. This subgraph is a connected subgraph and can have a tree, in this case we used the tree matching algorithm (Algorithm 3) to find the matches for this tree in the database graph. The results from the tree matching algorithm are: Vertex (1) : [1, 2, 3, 20] Vertex (2) : [6, 7, 9, 11] Vertex (3) : [8, 10, 12, 13] Vertex (4) : [16, 18] Vertex (5) : [15, 17, 19] These results represents the set of candidate vertices for each vertex in the query subgraph q1. Then, for each match of q1 in G, we mapped the matches vertices to the vertices in q1. In this process, we mapped the query graph edges using the previous results to find the candidate edges that match the edges in the query graph. The results from the mapping process are shows in figure (9).

Figure 10: e1 e2 e3 e4 e5 Matches Based of The Prefix’s Matches

After we determined the matches of q2, we added one edge (e6) to the current subgraph q2 to get new subgraph that is q3 = e1 e2 e3 e4 e5 e6. This subgraph will follow the same process as the previous one, we only need to search for e6 in the previous matches of q2. The results are shows in figure(11).

Figure 11: e1 e2 e3 e4 e5 e6 Matches Based of The Prefix’s Matches

Figure 9: e1 e2 e3 e4 Matches in G The results after mapping process by edges as follow: Match 1: [[9, 16], [16, 17], [10, 17], [3, 10]] Match 2: [[11, 18], [18, 19], [13, 19], [20, 13]] After identifies the matches of the subgraph q1, we continued adding one edge, which is (e5) to q1 to get a new subgraph that is q2 = e1 e2 e3 e4 e5. Again,

When we found the matches of q3, we cannot add any more edges to the current subgraph q3. In this case we jumped to the next subgraph that is not a super-graph of q3. For example q4 = e1 e2 e3 e4 e6. This subgraph can have more that one spanning trees, in this case, we have to find all the spanning trees first, then match the trees with the graph database G. We have to check every subgraph in the DepthFirst enumeration order because in some case the nonconnected subgraph can have a connected supergraph, an example for that is shown in figure (12).

Figure 8: The Subgraphs Queries’ Structure in Depth-First Enumeration Order

Figure 12: Connected Subgraph of Non-Connected Subgraph

Map The Matches With The Database Graph: In this process, we considered the two previous scenarios. If the subgraph has no prefix, then the mapping process consists of two steps: first, mapping the matches between the subgraph query and the database graph to identify the matches for each edge. This process helps to determine the candidate edges for each edge in the subgraph query. In the second step, we mapped these matches to find a complete subgraph match in the database graph. In our implementation, we implemented a method called Mapping-To-Find-Exact-Edge’s-Match. In

this method, we mapped the edges in the subgraph query to the edges in the database graph to identify the candidate edges. The whole process is shown in Algorithm 5, for each edge in the subgraph query, we compared the edge source direct neighbors’ matches with the edge target’s matches. Every time we found a match between these two sets of vertice, that means there is an edge between the edge source and target. Then, if this edge exists in G, we compared its label with the label of the edge in the subgraph query. If these two conditions are satisfied, then the edge in G is a match for the edge in the query graph. We continued matching the edges until we found matches for every edges in the subgraph query. For the second step, we implemented a method called Perform-The-Final-Matches this method created a Cartesian product out the edge’s matches from the previous step to get all possible paths. This method is built based on Algorithm 7. For the second scenario, if the subgraph has a prefix, we implemented a method called S earch-ForUndiscovered-Edges, this method uses Algorithm 6 to search for undiscovered edges in the prefix’s matches set. In this method, we compared the vertex’s matches for both edge source and target for the undiscovered edge with the prefix’s matches results. In this process we are trying to find matches for both edge source and target taking into consideration the edge label. If

we find both edge source and target in the previous matches that means there is a match for that edge. We continued searching for matches for undiscovered edges in all previous matches.

5.

DATASETS

We used two datasets with different graphs structure to analyze the performance of SAPPER: 1- Our generated dataset. 2- A biological dataset for aids .

5.1

Figure 13: Number of Approximate Matches with Different Thresholds.

Dataset 1:

This dataset consists of 65 vertices and 129 edges with three district labels, and the query graph has 6 edges and 5 vertices with three distinct labels.

5.2

Dataset 2:

This dataset is a biological dataset for aids that consists of 500 vertices and 500 edges in the database in 20 graphs, and 26,360 vertices and 24,000 edges in the query graphs in 4000 graphs. We used this dataset to identify some cases that affect the performance of SAPPER. In our experiments, we queried 20 database graphs and 4000 query graphs, the total number of the queries is 80.000 query process, and compared the results of the exact matches that SAPPER can retrieve with the exact matches of GraphQL. Also we used this dataset to analyze the relation between the query time and several parameters.

6.

EXPERIMENTAL RESULTS

In this section, we analyzed the performance of SAPPER using query graphs with different number of vertices and different thresholds. In dataset 1, we queried this dataset with θ = 0, and we did not get any exact match of the query graph in the first run. However by querying the database several times we could retrieve some matches. With this dataset the query graph is a complete graph with five vertices and six edges. In this case SAPPER generates random spanning trees that are subgraphs of the query graph then uses the tree matching algorithm to match every tree with the database graph G, then combines the results to get an exact match of the query graph. In this process, the probability of having the exact match of the query graph is related to the probability of including all the edges in the random spanning trees, and that’s why after several runs we may find some matches. However, when θ > 0, SAPPER was able to retrieve all the approximate matches of the query graph that has θ missing edges or less. Since we have six edges in the query graph, the maximum number of missing edge is 2. In our experiment we queried the database graph with θ = 0, that retrieves the exact matches of the graph, θ = 1 that retrieves the approximate matches with one missing edge, and θ = 2 to get the approximate matches that have two missing edges or less. The number of matches with each one of these thresholds are shown in the table in figure (13) and figure (14) shows the approximate matches, every set of colored vertices is an approximate match of the query graph.

Figure 14: The Approximate Matches of q with θ = 2

6.1

SAPPER and The Exact Matches:

We tested the efficiency of SAPPER to retrieve the exact matches of the query graph comparing to GraphQl. In our experiment, we queried SAPPER with θ = 0 and We found that SAPPER may not be able to identify all the exact matches of the query graph directly in some query graphs. After we had compared the exact matches in the query graph one by one, we could identify that with a query graph that has more than one spanning trees, SAPPER was not able to retrieve all the exact matches for the same reason that we clarified previously. However, in case if the query graph cannot have more than one spanning tree, SAPPER was able to retrieve all exact matches of the query graph. SAPPER took four hours to find 24640293 exact matches in 10.000.000 queries, while graph GraphQ1 took less than five minutes to retrieve 25111606 exact matches. The table in figure (15) represents the number of exact matches for part of the database 2 that include 20 database graphs and 1000 query graphs in each category in both SAPPER and GraphQL.

Figure 15: Number of Exact Matches.

6.2

The Approximate Matches and The Threshold:

We studied the relation between the number of approximate matches and the threshold, and we found that the number of approximate matches with θ >=1 increases by the number of missing edges as shown in the table below. Also we can see in figures (16) that with the query graph with four vertices, the number of matches is the same even with different thresholds, that means there are no approximate matches for these set of query graphs.

Figure 17: Distinct Labels in G and The Query Time

6.3.2

Number of Vertices in The Query Graph:

In SAPPER increasing the number of missing edges will increase the number of the subgraphs that need to be enumerated and match with the database graph. The table in figure (18), shows the relation between the number of the vertices in the query graph and the query time. Each row shows the increasing of the query time with increasing the number of vertices in the query graph, and each column shows the increasing of the query time with increasing the number of missing edges. Figure 16: Threshold

6.3

The Approximate Matches and The

The Query Time:

We analyzed the query time in SAPPER on different parameters, and found that the query time is affected by the number of vertices in the query graph q, the number of distinct labels in the database graph G, and the threshold.

6.3.1

Number of Distinct Labels in The Database:

We studied the relation between the query time and the number of district labels in the database graph by filtering the database graphs based on the number of distinct labels on the vertices into three sets: graphs with one label, two labels, and three or more labels. Then we queried these databases several times to get the average of the query time for query graphs with 4, 8, and 12 vertices. The result of our experiment was as shown in figure(17). We can see that increasing the number of distinct labels in G decreases the query time, but increasing the number of vertices in the query graph increases the query time even with the some number of distinct labels. The reason here, is that if the database graph has distinct labels, the number of candidate matches in the vertex matching phase will decrease and that will reduce the time taking to prune the vertices in the tree matching process. This will reduce the number of matching edges as well when it comes to the mapping process.

Figure 18: Number of Matches and The Query Time As a conclusion of our experiment, SAPPER is an effective method to retrieve all approximate matches of the query graph with some missing edges. However, it cannot directly identify the exact matches with θ = 0 in some cases. We also found that the query time is affected by the number of distinct labels in the database graph, the number of vertices in the query graph, and the number of the subgraphs that are enumerated based on the value of the given threshold.

7.

CONCLUSION

We implemented SAPPER as a method to retrieve all approximate matches of the query graph with possible missing edges using the pre-generated random spanning trees and the query graph enumeration order. We also studied each process in this method in details with some examples. Finally, in our experiment, we analyzed the performance of SAPPER using a real dataset and identified some case where SAPPER can and cannot retrieve all the exact matches of the query graph. Also we identified the parameters that affect the query time for future work to improve the performance and the scalability of SAPPER to query larger graph .

8.

REFERENCES

[1] Y. Tian and J. M. Patel. TALE: A tool for approximate large graph matching. In Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, April 7-12, 2008, Canc´ un, M´exico, pages 963–972, 2008. [2] S. Zhang, S. Li, and J. Yang. Gaddi: Distance index based subgraph matching in biological networks. In In Proceedings of the 12th international conference on extending database ˘ Z09, ´ technology (EDBTˆ aA 2009. [3] S. Zhang, J. Yang, and W. Jin. SAPPER: subgraph indexing and approximate matching in large graphs. PVLDB, 3(1):1185–1194, 2010.

APPENDIX A.

ALGORITHM DESCRIPTION Algorithm 3: Tree Matching

Algorithm 1: Vertex Matching

1 2 3 4 5 6 7 8

Input: Graph database G, Query graph q. Output: a set of matching vertices for every vertex in q. begin for V q ∈ q do for V g ∈ G do if vq 0 s label = V g 0 s label then if vq 0 s degree ≤ V g 0 s degree then if N 1(vq, q) labels ⊆ N 1(vG, G) then vG is a match for vq; M (vq).add (Vg);

1 2 3 4 5 6 7 8 9 10

Input: Set of all the spanning trees T. Output: Set of matches vertices for each tree. begin for T i ∈ T do for v ∈ T i based on DFS vertex order do Get Vertex Matches M (v) Find Connected Vertex to v in the tree for vi ∈ M (v) do Get N 1(vi, G)TFirst neighbors if N 1(vi, G) M (u) then u ∈ M (V, T i) else Remove vi from M (v, T i)

Else If N 2(vq, q) labels ⊆ N 2(V g, G) V g is a match for vq M (vq).add (Vg) Else V g is not a match for vq

9 10 11 12 13

Algorithm 4: Query Graph Enumeration Order. Algorithm 2: Generating Random Spanning Tree

1 2 3 4 5 6 7 8 9 10 11 12 13

Input: The query graph q. Output: Random Spanning Tree of q. begin Construct transition matrix P from q. Vertex set S← φ , edge list E ← φ . Randomly select a vertex X0 of q. S ⇐ S + X0 v ← X0. while S ≤ |V (q)| do Randomly select vertex w by P, evw exists. if ! (w ∈ S) then E ← E + evw S ←S+w

1 2 3 4 5 6 7 8

v←w Output the graph composed of edge list E. 9 10

Input: Query graph edge List = (v1 : v2), (v2 : v3), , (∀(vi : vj) ∈ q), DFS order of the query graph q Q-DFS= v1, v3, , (∀(vi ∈ q)). Candidate set M (vi)(∀vi ∈ q) Output: lexicographical order of each edge. begin for V q ∈ q based on DFS order do Find M (V q)= v1, v2, ... for U q vertex that connect to V q ∈ q do Get U q candidate set M (U q)=u1, u2 for ui ∈ T M (U q) do if ui vi ∈ M (V q)0 s direct neighbors then // if there is a match, then there is an edge (vi : ui) ∈ G. if (vi : ui)0 s label = (V q : U q)0 s label then EdgeMatches ++ ;

11

else No match (vi: ui) in G.

12

Edge-Occurrences-List.add((vi : ui), EdgeMatches);

13

14 15

Sort the edges in Edge-Occurrences-List smallest to largest based on the number of the matches. Assign a lexicographical order to each edge based on its order in the list. Return The Lexicographical Order;

Algorithm 5: Find Exact Edge’s Matches in G

1 2 3 4 5

6

7 8

9

Input: Tree-Matches Results, which include matches by Vertex, Current SubgraphQuery Output: set of matches for each edge in the subgraph Query begin for each T reeM atches(i) ∈ T reeM atches do for each Edgeei ∈ SubgraphQuery do V s, ei’s Edge Source. V t, ei’s Edge Target. // Compare V s0 s first neighbors matches with Tree-Matches(V t0 s)’s matches T if V s0 s first neighbors in T 0 Tree-Matches(V t s) then // There is an edge between Vs and Vt in G if ei0 s label equal (vs : vt) label then (vs0 smatch : vt0 smatch) is a match for ei’s matches. Return Edge’s Matches set

Algorithm 7: Map The Final Matches

1 2

3 4 5 6 7 8 9

Algorithm 6: Search for Matches for Undiscovered Edges

1 2 3 4 5

6 7 8

9 10

11

12

Input: prefix’s-Matches, that are matches by edge, currentSequence, Tree-Matches, which are matches by Vertex, SubgraphQuery Output: All approximate matches of g(s)’s edges that have not been discovered begin Undiscovered Edges = CurrentSequence’s edges − Prefix’s edges for each ei ∈ U ndiscoveredEdges do V q1 ei’s vertex source. V q2 ei’s edge target. // Iterate over the tree vertex matches of the prefix for each match’s Set ∈ pref ixsM atches do Retrieve V q1 and V q2 matches. Find the connected edges to V q1 and V q2. // Compare V q10 s matches in Tree-Matches With the prefix’s-Matches of the connected edges ej to V q1 if prefix’s-Matches (ej) contains some of the Tree-Matches (V q1) then if prefix’s-Matches (el) contains some of the Tree-Matches (V q2) then if (V q10 smatch : V q20 smatch)isanedge ∈ G then (V q10 s match :V q20 s) ∈ ei’s matches

10 11 12 13 14

15 16 17

18 19 20 21

22

Input: VertexOrder, sequence, subgraphQuery, EdgeMatches Output: list of all approximate matches of the subgraph query. begin for V ertexi ∈ V ertexOrder do // ADJ is a list to represents vertices connectivity in q connectedV ertices = ADJ(V ertexi) FirstLoop: for V i ∈ connectedV ertices do // Return the edge number EdgeOrder= GetEdgeNumber (V ertexi,V i) If (EdgeOrder! = N ull ) M(EdgeOrder)=EdgeMatches.get(EdgeOrder) Innerloop: for ej ∈ M (EdgeOrder) do if IntialSolution.isEmpty() then IntialSolution.put(EdgeOrder,ej) EdgesMatched.add(EdgeOrder) Break Innerloop: if (IntialSolution.size() ≤ 1) & & (EdgesM atched.size() ≤ 2) then if HasRef erence(IntialSolution, ej) then IntialSolution.put(EdgeOrder,ej) EdgesMatched.add(ej) Break Innerloop: if (j == M (EdgeOrder).size() − 1) then IntialSolution.remove(EdgeOrder, ej) Break Firstloop: if EdgesM atched.size() = 2 then CartesianP roduct(EdgesMatched(e1)’s matches, EdgesMatched(e2)’s matches) // Group each two matches together if there is reference use CartesianP roduct to map the rest of vertices.

Algorithm 8: SAPPER

1 2 3 4 5 6 7 8 9

10 11

12 13

14 15 16

17 18

19

20 21 22

23 24 25

Input: database graph G, query graph q, threshold θ. Output: list of all approximate matches of q. begin Edge list EL ← 1,...,el, (∀ ei ∈ q), l ← |E(q)|. Sort q 0 s edges decreasingly by their number of matches in G. Assign lexicographical order for the edges in q using Algorithm(4). s ← e1, ..slθ, |s| = l − θ. End is a subgraph that includes the largest lexicographical order. while S 6= End do if IsConnected(g(s)) then if the subgraph s corresponding to the longest prefix of s is not matched yet then if s corresponding to one spanning tree then Find g(s) matches using The Spanning Tree Matching Algorithm (3) Find Best Vertex’s Search Order. Map the matches (Subgraph’s Matches, Search Order) Algorithm(8) if s can have more than one spanning tree then Find Random spanning tree (g(s)) Algorithm(2) Find g(s) matches using The Spanning Tree Matching Algorithm (3) Find Best Vertex’s Search Order. Map the matches (Subgraph’s Matches, Search Order) Algorithm(8) Else // The longest prefix of the subgraph have been discovered Search for Undiscovered Edges in prefix’s matches, Algorithm(7) if g(s) has no match then s ← get a new sequence that does not contain g(s) as prefix. Else s ← add one edge to g(s) Else // if g(s) is not connected continue

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close