Efficient Memoization Algorithms for Query Optimization: Top-Down Join Enumeration through Memoization on the Basis of Hypergraphs
©2014
Textbook
200 Pages
Summary
For a DBMS that provides support for a declarative query language like SQL, the query optimizer is a crucial piece of software. The declarative nature of a query allows it to be translated into many equivalent evaluation plans. The process of choosing a suitable plan from all alternatives is known as query optimization. The basis of this choice are a cost model and statistics over the data. Essential for the costs of a plan is the execution order of join operations in ist operator tree, since the runtime of plans with different join orders can vary by several orders of magnitude. An exhaustive search for an optimal solution over all possible operator trees is computationally infeasible. To decrease complexity, the search space must be restricted. Therefore, a well-accepted heuristic is applied: All possible bushy join trees are considered, while cross products are excluded from the search.<br>There are two efficient approaches to identify the best plan: bottom-up and top- down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude, while still preserving optimality.<br>Hence, this book focuses on the top-down join enumeration. In the first part, we present two efficient graph-partitioning algorithms suitable for top-down join enumer- ation. However, as we will see, there are two severe limitations: The proposed algo- rithms can handle only (1) simple (binary) join predicates and (2) inner joins. Therefore, the second part adopts one of the proposed partitioning strategies to overcome those limitations. Furthermore, we propose a more generic partitioning framework that enables every graph-partitioning algorithm to handle join predicates involving more than two relations, and outer joins as well as other non-inner joins. As we will see, our framework is more efficient than the adopted graph-partitioning algorithm. The third part of this book discusses the two branch-and-bound pruning strategies that can be found in the literature. We present seven advancements to the combined strategy that improve pruning (1) in terms of effectiveness, (2) in terms of robustness and (3), most importantly, avoid the worst-case behavior otherwise observed.<br>Different experiments evaluate the performance improvements of our proposed methods. We use the TPC-H, TPC-DS and SQLite test suite benchmarks to evalu- ate our joined contributions. As we show, the average compile time […]
Excerpt
Table Of Contents
Contents
1. Introduction
15
1.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.2. Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
1.2.1.
Graph-Aware Join Enumeration Algorithms . . . . . . . . . .
17
1.2.2.
Hypergraph-Aware Join Enumeration Algorithms . . . . . . .
17
1.2.3.
Branch-and-Bound Pruning . . . . . . . . . . . . . . . . . .
18
1.2.4.
Conclusion and Appendix . . . . . . . . . . . . . . . . . . .
18
2. Graph-Aware Join Enumeration Algorithms
19
2.1. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.1.1.
Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.1.2.
Query Graphs, Plan Classes and Costs . . . . . . . . . . . . .
22
2.1.3.
Problem Specification . . . . . . . . . . . . . . . . . . . . .
24
2.2. Basic Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
2.2.1.
Generic Top-Down Join Enumeration . . . . . . . . . . . . .
24
2.2.2.
Naive Partitioning
. . . . . . . . . . . . . . . . . . . . . . .
26
2.2.3.
Exemplified Execution of TDB
ASIC
. . . . . . . . . . . . . .
27
2.2.4.
Analysis of TDB
ASIC
. . . . . . . . . . . . . . . . . . . . .
28
2.3. Advanced Generate and Test . . . . . . . . . . . . . . . . . . . . . .
31
2.3.1.
TDM
IN
C
UT
AG
A
T . . . . . . . . . . . . . . . . . . . . . . .
31
2.3.2.
An Improved Connection Test . . . . . . . . . . . . . . . . .
33
2.3.3.
Correctness of Advanced Generate and Test . . . . . . . . . .
34
2.4. Conservative Graph Partitioning . . . . . . . . . . . . . . . . . . . .
39
2.4.1.
Correctness Constraints
. . . . . . . . . . . . . . . . . . . .
40
2.4.2.
The Algorithm in Detail . . . . . . . . . . . . . . . . . . . .
40
2.4.3.
Exploring Connected Components . . . . . . . . . . . . . . .
42
2.5. Branch Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
2.5.1.
Branch Partitioning - An Overview
. . . . . . . . . . . . . .
42
2.5.2.
The Algorithm in Detail . . . . . . . . . . . . . . . . . . . .
44
2.5.3.
Two Optimization Techniques . . . . . . . . . . . . . . . . .
46
2.5.4.
Exploring Restricted Neighbors . . . . . . . . . . . . . . . .
47
2.5.5.
Two Examples . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.5.6.
Complexity of Branch Partitioning . . . . . . . . . . . . . . .
48
2.6. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
2.6.1.
Experimental Setup . . . . . . . . . . . . . . . . . . . . . . .
50
2.6.2.
Organizational Overview . . . . . . . . . . . . . . . . . . . .
50
2.6.3.
Experiments
. . . . . . . . . . . . . . . . . . . . . . . . . .
51
5
3. Hypergraph-Aware Join Enumeration Algorithms
63
3.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
3.2. Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.2.1.
Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . .
65
3.2.2.
Graphs vs. Hypergraphs . . . . . . . . . . . . . . . . . . . .
68
3.2.3.
Neighborhood
. . . . . . . . . . . . . . . . . . . . . . . . .
69
3.3. Basic Memoization for Hypergraphs . . . . . . . . . . . . . . . . . .
70
3.3.1.
Generic Top-Down Join Enumeration for Hypergraphs . . . .
70
3.3.2.
Naive Partitioning of Hypergraphs . . . . . . . . . . . . . . .
71
3.3.3.
Test for Connectedness of Hypergraphs . . . . . . . . . . . .
71
3.4. Conservative Partitioning for Hypergraphs . . . . . . . . . . . . . . .
73
3.4.1.
Overview of M
IN
C
UT
C
ONSERVATIVE
H
YP
. . . . . . . . . .
74
3.4.2.
The Algorithm in Detail . . . . . . . . . . . . . . . . . . . .
76
3.4.3.
Avoiding Duplicates . . . . . . . . . . . . . . . . . . . . . .
77
3.4.4.
GET
C
ONNECTED
C
OMPONENTS
. . . . . . . . . . . . . . . .
82
3.4.5.
MAINTAIN
C
SET
. . . . . . . . . . . . . . . . . . . . . . . .
83
3.4.6.
An Example
. . . . . . . . . . . . . . . . . . . . . . . . . .
84
3.5. Generic Top-Down Join Enumeration for Hypergraphs . . . . . . . .
85
3.5.1.
High-Level Overview . . . . . . . . . . . . . . . . . . . . . .
85
3.5.2.
Structure of the Generic Partitioning Framework . . . . . . .
89
3.5.3.
Generating the Adjacency Information . . . . . . . . . . . . .
91
3.5.4.
Composing Compound Vertices . . . . . . . . . . . . . . . .
94
3.5.5.
Economizing on Connection Tests . . . . . . . . . . . . . . . 104
3.5.6.
Efficient Subgraph Handling . . . . . . . . . . . . . . . . . . 108
3.5.7.
Implementation Details . . . . . . . . . . . . . . . . . . . . . 120
3.6. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.1.
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.2.
Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
3.6.3.
Organizational Overview . . . . . . . . . . . . . . . . . . . . 123
3.6.4.
Evaluation of Random Acyclic Query Graphs . . . . . . . . . 124
3.6.5.
Evaluation of Random Cyclic Query Graphs
. . . . . . . . . 128
3.6.6.
Overhead Detection
. . . . . . . . . . . . . . . . . . . . . . 129
3.6.7.
Performance Evaluation with Different Benchmarks
. . . . . 129
4. Branch-and-Bound Pruning
137
4.1. Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.2. Accumulated-Cost Bounding and Predicted-Cost Bounding . . . . . . 138
4.2.1.
Building a Join Tree . . . . . . . . . . . . . . . . . . . . . . 138
4.2.2.
Accumulated-Cost Bounding . . . . . . . . . . . . . . . . . . 138
4.2.3.
Predicted-Cost Bounding . . . . . . . . . . . . . . . . . . . . 139
4.2.4.
Combining the Methods . . . . . . . . . . . . . . . . . . . . 140
4.2.5.
An Example for Accumulated-Predicted-Cost Bounding . . . 140
4.3. Technical Advances . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.4. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.4.1.
Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 144
4.4.2.
Workload . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
4.4.3.
Performance Evaluation with Simple Query Graphs . . . . . . 146
6
4.4.4.
Performance Evaluation with Complex Query Graphs . . . . . 156
4.4.5.
Performance Evaluation with Different Benchmarks
. . . . . 160
5. Conclusion
169
5.1. Graph-Aware Join Enumeration Algorithms . . . . . . . . . . . . . . 170
5.2. Hypergraph-Aware Join Enumeration Algorithms . . . . . . . . . . . 170
5.3. Branch and Bound Pruning . . . . . . . . . . . . . . . . . . . . . . . 170
5.4. Graceful Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . 171
5.5. Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
A. TDMinCutLazy
173
A.1. Important Notions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
A.2. Lazy Minimal Cut Partitioning . . . . . . . . . . . . . . . . . . . . . 173
A.3. Biconnection Tree Building . . . . . . . . . . . . . . . . . . . . . . . 174
A.3.1. Biconnection Tree Construction . . . . . . . . . . . . . . . . 174
A.3.2. Complexity of Biconnection Tree Construction . . . . . . . . 176
A.3.3. Computation of Ancestors and Descendants . . . . . . . . . . 176
A.3.4. An Alternative to Tree Construction . . . . . . . . . . . . . . 178
A.4. Complexity of Lazy Minimal Cut Partitioning . . . . . . . . . . . . . 178
A.5. Improved Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
A.5.1. Global Reuse of the Biconnection Tree . . . . . . . . . . . . 180
A.5.2. Analyzing the Number of Tree Buildings . . . . . . . . . . . 180
B. Iterator Implementations
183
B.1. Conservative Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 183
B.2. Branch Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
7
List of Figures
2.1. Example graph with eleven relations. . . . . . . . . . . . . . . . . . .
22
2.2. Pseudocode for TDP
LAN
G
EN
. . . . . . . . . . . . . . . . . . . . .
25
2.3. Pseudocode for
BUILD
T
REE
. . . . . . . . . . . . . . . . . . . . . .
26
2.4. Pseudocode for naive partitioning
. . . . . . . . . . . . . . . . . . .
26
2.5. Pseudocode for
IS
C
ONNECTED
. . . . . . . . . . . . . . . . . . . . .
27
2.6. Example graph with four relations . . . . . . . . . . . . . . . . . . .
28
2.7. Selectivity and Cardinalities for the Query Graph of Figure 2.6 . . . .
28
2.8. Pseudocode for M
IN
C
UT
AG
A
T
. . . . . . . . . . . . . . . . . . . .
33
2.9. Pseudocode for
IS
C
ONNECTED
I
MP
. . . . . . . . . . . . . . . . . .
33
2.10. Pseudocode for M
IN
C
UT
C
ONSERVATIVE
. . . . . . . . . . . . . . .
41
2.11. Pseudocode for
GET
C
ONNECTED
C
OMPONENTS
. . . . . . . . . . .
43
2.12. Pseudocode for P
ARTITION
M inCutBranch
. . . . . . . . . . . . . . .
46
2.13. Pseudocode for M
IN
C
UT
B
RANCH
. . . . . . . . . . . . . . . . . . .
47
2.14. Pseudocode for
REACHABLE
. . . . . . . . . . . . . . . . . . . . . .
48
2.15. Chain Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.16. Cyclic Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
2.17. Cost per emitted ccp
. . . . . . . . . . . . . . . . . . . . . . . . . .
52
2.18. Absolute and normed runtime results for chain queries. . . . . . . . .
54
2.19. Absolute and normed runtime results for star queries. . . . . . . . . .
55
2.20. Absolute and normed runtime results for random acyclic queries . . .
57
2.21. Absolute and normed runtime results for cycle queries. . . . . . . . .
58
2.22. Absolute and normed runtime results for clique queries. . . . . . . . .
59
2.23. Absolute and normed runtime results for cyclic queries (8 vertices) . .
60
2.24. Absolute and normed runtime results for cyclic queries (16 vertices) .
61
3.1. Hypergraph H
(V, E) with five relations . . . . . . . . . . . . . . . .
65
3.2. A bitvector with
2
x
bits . . . . . . . . . . . . . . . . . . . . . . . . .
68
3.3. Pseudocode for TDP
LAN
G
EN
H
YP
. . . . . . . . . . . . . . . . . . .
71
3.4. Pseudocode for
BUILD
T
REE
. . . . . . . . . . . . . . . . . . . . . .
71
3.5. Pseudocode for naive partitioning for hypergraphs . . . . . . . . . . .
72
3.6. Pseudocode for
IS
C
ONNECTED
H
YP
. . . . . . . . . . . . . . . . . .
73
3.7. Pseudocode for
GET
S
IMPLE
C
OMPONENTS
. . . . . . . . . . . . . .
73
3.8. Pseudocode for
MERGE
C
OMPONENTS
. . . . . . . . . . . . . . . . .
74
3.9. Hypergraph with five relation and a disconnected hypernode. . . . . .
77
3.10. Pseudocode for M
IN
C
UT
C
ONSERVATIVE
H
YP
. . . . . . . . . . . . .
78
3.11. Sample Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
3.12. Pseudocode for
MAINTAIN
X
MAP
. . . . . . . . . . . . . . . . . . .
80
3.13. Pseudocode for
CHECK
X
MAP
. . . . . . . . . . . . . . . . . . . . .
80
3.14. Hypergraph H
(V, E) with five relations . . . . . . . . . . . . . . . .
81
9
3.15. Pseudocode for
MAINTAIN
X
MAP
N oF N
. . . . . . . . . . . . . . . .
83
3.16. Pseudocode for
CHECK
X
MAP
N oF N
. . . . . . . . . . . . . . . . . .
83
3.17. Pseudocode for
GET
C
ONNECTED
S
ETS
. . . . . . . . . . . . . . . .
84
3.18. Pseudocode for
MAINTAIN
C
SET
. . . . . . . . . . . . . . . . . . . .
84
3.19. (a) Overlapping hyperedges, (b) and (c) simple graphs . . . . . . . . .
87
3.20. (a) Hypergraph, (b) simple graph and (c) final graph . . . . . . . . . .
88
3.21. Pseudocode for
PARTITION
X
. . . . . . . . . . . . . . . . . . . . . .
91
3.22. Call graph for the top-level case of
PARTITION
X
. . . . . . . . . . . .
92
3.23. Struct StackEntry . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
3.24. Struct HyperEdge . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
3.25. Global Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
3.26. Pseudocode for
INITIALIZE
I
NFO
S
TACK
. . . . . . . . . . . . . . . .
95
3.27. Pseudocode for
COMPUTE
L
OOK
U
P
I
DX
. . . . . . . . . . . . . . . .
95
3.28. Struct Overlap used by
COMPUTE
A
DJACENCY
I
NFO
. . . . . . . . .
96
3.29. Pseudocode for
COMPUTE
A
DJACENCY
I
NFO
. . . . . . . . . . . . .
97
3.30. Pseudocode for
STORE
C
OMPLEX
H
YPER
E
DGE
. . . . . . . . . . . .
98
3.31. Pseudocode for
DE
R
EFERENCE
. . . . . . . . . . . . . . . . . . . .
98
3.32. Pseudocode for
STORE
A
DJACENCY
I
NFO
. . . . . . . . . . . . . . .
98
3.33. Pseudocode for
COMPOSE
C
OMPOUND
V
ERTICES
. . . . . . . . . . .
99
3.34. Pseudocode for
MANAGE
A
DJACENCY
I
NFO
. . . . . . . . . . . . . . 100
3.35. Pseudocode for
GET
BCCI
NFO
. . . . . . . . . . . . . . . . . . . . . 102
3.36. Pseudocode for
FIND
I
NITIAL
C
OMPOUNDS
. . . . . . . . . . . . . . 103
3.37. Pseudocode for
MAINTAIN
L
ABELS
. . . . . . . . . . . . . . . . . . 103
3.38. Pseudocode for
DECODE
. . . . . . . . . . . . . . . . . . . . . . . . 104
3.39. Pseudocode for
CONNECTION
T
EST
R
EQUIRED
. . . . . . . . . . . . 105
3.40. (a) Hypergraph, (b) simple graph and (c) final graph . . . . . . . . . . 106
3.41. Pseudocode for
MAXIMIZE
C
OMPOUND
V
ERTICES
. . . . . . . . . . 107
3.42. Call graph for the top-level case of
PARTITION
X
. . . . . . . . . . . . 109
3.43. Pseudocode for
MANAGE
I
NFO
S
TACK
. . . . . . . . . . . . . . . . . 110
3.44. Pseudocode for
COMPUTE
F
ILTERS
. . . . . . . . . . . . . . . . . . . 110
3.45. Pseudocode for
CLEANSE
H
YPER
N
EIGHBOURS
. . . . . . . . . . . . 112
3.46. (a) Hypergraph, (b) simplified graph and (c) final graph . . . . . . . . 114
3.47. (a) Sub-hypergraph of Figure 3.46 (b) a disconnected graph . . . . . . 114
3.48. (a) Hypergraph, (b) simplified graph and (c) final graph . . . . . . . . 116
3.49. (a) Sub-hypergraph of Figure 3.48 (b) simple graph . . . . . . . . . . 116
3.50. Pseudocode for
RE
C
OMPOSE
C
OMPOUND
V
ERTICES
. . . . . . . . . 118
3.51. Pseudocode for
ADJUST
C
OMPOUND
F
ILTER
. . . . . . . . . . . . . . 119
3.52. Pseudocode for
RE
M
ANAGE
A
DJACENCY
I
NFO
. . . . . . . . . . . . 120
3.53. Acyclic/inner/complex . . . . . . . . . . . . . . . . . . . . . . . . . 126
3.54. Acyclic/non-inner/simple . . . . . . . . . . . . . . . . . . . . . . . . 127
3.55. Cyclic/inner/complex with 10 relations . . . . . . . . . . . . . . . . . 130
3.56. Cyclic/inner/complex with 15 relations . . . . . . . . . . . . . . . . . 131
3.57. Cyclic/non-inner/simple with 10 relations . . . . . . . . . . . . . . . 132
3.58. Cyclic/non-inner/simple with 15 relations . . . . . . . . . . . . . . . 133
4.1. Pseudocode for
BUILD
T
REE
. . . . . . . . . . . . . . . . . . . . . . 138
4.2. Pseudocode for TDPG
ACB
. . . . . . . . . . . . . . . . . . . . . . . 139
10
4.3. Pseudocode for TDPG
P CB
. . . . . . . . . . . . . . . . . . . . . . . 140
4.4. Pseudocode for TDPG
ACBI
. . . . . . . . . . . . . . . . . . . . . . 142
4.5. Relation and domain sizes for random join queries . . . . . . . . . . . 145
4.6. Performance results for chain queries. . . . . . . . . . . . . . . . . . 151
4.7. Performance results for star queries. . . . . . . . . . . . . . . . . . . 152
4.8. Performance results for random acyclic queries . . . . . . . . . . . . 153
4.9. Density plot of random acyclic queries.
. . . . . . . . . . . . . . . . 154
4.10. Performance results for cycle queries. . . . . . . . . . . . . . . . . . 155
4.11. Performance results for clique queries. . . . . . . . . . . . . . . . . . 156
4.12. Performance results for random cyclic queries with 10 vertices. . . . . 157
4.13. Performance results for random cyclic queries with 15 vertices. . . . . 158
4.14. Density plot of random cyclic queries. . . . . . . . . . . . . . . . . . 159
4.15. Performance of different Pruning Advancements. . . . . . . . . . . . 160
4.16. Acyclic/inner/complex . . . . . . . . . . . . . . . . . . . . . . . . . 161
4.17. Acyclic/non-inner/simple . . . . . . . . . . . . . . . . . . . . . . . . 162
4.18. Cyclic/inner/complex with 10 relations . . . . . . . . . . . . . . . . . 163
4.19. Cyclic/inner/complex with 15 relations . . . . . . . . . . . . . . . . . 164
4.20. Cyclic/non-inner/simple with 10 relations . . . . . . . . . . . . . . . 165
4.21. Cyclic/non-inner/simple with 15 relations . . . . . . . . . . . . . . . 166
4.22. Density plots for TDM
C
BH
YP
AP CBI
with TDM
C
BH
YP
as norm . . 167
4.23. Density plots for TDM
C
BH
YP
AP CBI
with DPH
YP
as norm . . . . . 167
A.1. Pseudocode for M
IN
C
UT
L
AZY
. . . . . . . . . . . . . . . . . . . . . 174
A.2. Pseudocode for
BUILD
BCT
. . . . . . . . . . . . . . . . . . . . . . 176
A.3. Pseudocode for
BUILD
BCTS
UB
. . . . . . . . . . . . . . . . . . . . 177
11
List of Tables
2.1. Exemplified execution of TDB
ASIC
for the input graph of Figure 2.6 .
29
2.2. Multiple calls to B
UILD
T
REE
during the enumeration process
. . . .
29
2.3. All connected (sub) sets S and the corresponding P
sym
ccp
(S) . . . . . .
30
2.4. Sample values for inner counter and number of calls to
BUILD
P
LAN
.
30
2.5. Exemplified execution of M
IN
C
UT
B
RANCH
for Figure 2.15 . . . . .
49
2.6. Exemplified execution of M
IN
C
UT
B
RANCH
for Figure 2.16 . . . . .
49
2.7. Names and abbreviated names of different plan generators . . . . . .
51
2.8. Minimum, maximum and average of the normalized runtimes . . . . .
56
2.9. Minimum, maximum and average of the normalized runtimes . . . . .
56
3.1. M
IN
C
UT
C
ONSERVATIVE
H
YP
emits a duplicate ccp . . . . . . . . . .
82
3.2. Exemplified execution of M
IN
C
UT
C
ONSERVATIVE
H
YP
. . . . . . .
86
3.3. Names of different plan generation algorithms . . . . . . . . . . . . . 124
3.4. Performance results for random queries . . . . . . . . . . . . . . . . 125
3.5. Normed runtimes for Chain, Star, Cycle and Clique queries . . . . . . 129
3.6. Normed runtimes for some TPC-H Queries . . . . . . . . . . . . . . 134
3.7. Runtimes for TPC-H Queries . . . . . . . . . . . . . . . . . . . . . . 135
3.8. Runtimes for TPC-DS Queries . . . . . . . . . . . . . . . . . . . . . 135
3.9. Runtimes for SQLite test suite queries . . . . . . . . . . . . . . . . . 135
4.1. Exemplified execution of TDPG
AP CB
. . . . . . . . . . . . . . . . . 141
4.2. Abbreviated names of different algorithms . . . . . . . . . . . . . . . 146
4.3. Normed runtimes for chain and star queries . . . . . . . . . . . . . . 148
4.4. Normed runtimes for random acyclic and random cyclic queries . . . 148
4.5. Normed runtimes for cycle and clique queries . . . . . . . . . . . . . 149
4.6. Number of optimal join trees built and failed join tree requests . . . . 150
4.7. Abbreviated names of different algorithms . . . . . . . . . . . . . . . 157
4.8. Performance results for random queries . . . . . . . . . . . . . . . . 160
4.9. Normed runtimes for some TPC-H Queries . . . . . . . . . . . . . . 162
4.10. Runtimes for TPC-H Queries . . . . . . . . . . . . . . . . . . . . . . 163
4.11. Runtimes for TPC-DS Queries . . . . . . . . . . . . . . . . . . . . . 165
4.12. Runtimes for SQLite test suite queries . . . . . . . . . . . . . . . . . 166
A.1. Number of biconnection tree buildings . . . . . . . . . . . . . . . . . 182
13
1. Introduction
1.1. Motivation
Queries against DBMSs are often formulated in declarative languages. Prominent
examples are SQL, OQL, XPath and XQuery. Writing such a declarative query has
two advantages: (1) The querist does not need to decide upon the actual algorithms and
execution order to access and combine the data, which in turn (2) leaves the DBMS
with several degrees of freedom to choose the best evaluation and execution strategy
in order to answer the query. This is a shift of complexity: from formulating the
query towards how to answer it in a most efficient way. We refer to the process of
transforming the declarative query in an imperative execution plan as plan generation,
and we call the component in the DBMS which deals with the complexity of choosing
a suitable plan from all alternatives the plan generator.
Today's plan generators are cost-based. This means that they rely on a cost model
and statistics over the data in order to select from all equivalent plans the one with
the lowest costs. Essential for the costs of a plan is the execution order of join opera-
tions in its operator tree, since the runtime of plans with different join orders can vary
by several orders of magnitude. An exhaustive search for an optimal solution over
all possible operator trees is computationally infeasible. To decrease complexity, the
search space must be restricted. For the optimization problem discussed in this book,
the well-accepted connectivity heuristic is applied: We consider all possible bushy join
trees, but exclude cross products from the search, presuming that all considered queries
span a connected query graph. Thereby, a query graph is an undirected graph where
join predicates span the edges between the relations referenced in the SQL query, i.e.,
a graph edge between R
1
and R
2
is introduced if there exists a join predicate involving
attributes of R
1
and R
2
.
When designing a plan generator, there are two strategies to find an optimal join
order: bottom-up join enumeration via dynamic programming, and top-down join enu-
meration through memoization.
Both plan generation approaches rely on Bellman's Principle of Optimality
1
: They
generate an optimal join tree for a set of relations S by considering optimal subjoin
trees only. This means that non-optimal, i.e., more expensive, subjoin trees can be dis-
carded, which curtails the search space enormously
2
. Moreover, since the connectivity
heuristic is applied
3
, the optimal (sub) join tree needs to be constructed only for those
1
The presence of properties requires additional care. For reasons of simplicity properties are ignored
here.
2
The search space is reduced from
|V |! C(|V | - 1) number of plans to
3
|V |
-2
|V |+1
+1
2
where
|V | is the
number of relations referenced in the query and
C are the Catalan Numbers with C(n) =
1
n+1
2n
n
[4].
|V |! C(|V | - 1) can be simplified to
(2|V |-2)!
(|V |-1)!
[13, 19, 31]
3
Which can reduce the search space further depending on the query graph down to
|V |
3
-|V |
3
number of
plans.
15
subsets of relations S that can be joined without the need of applying cross products.
In other words, the subset S of relations referenced in the SQL query has to induce a
connected subgraph of the original query graph.
In order to determine the best join tree for a given subset S of relations, the top-
down/bottom-up plan generator must enumerate all partitions
(S
1
, S
2
) of S such that
S
= S
1
S
2
and S
1
S
2
= holds. Furthermore, since we exclude cross products,
S
1
and S
2
must induce connected subgraphs, and there must be two relations R
1
S
1
and R
2
S
2
such that they are connected by a graph edge. Let us call such a partition
(S
1
, S
2
) a ccp. Denote by T
i
the optimal plan for S
i
. Then the query optimizer has to
consider the plans T
1
I T
2
for all ccps
(S
1
, S
2
) in order to compute the optimal join
tree for the relations contained in S.
Thus, both the bottom-up and the top-down join enumeration face the same chal-
lenge: to efficiently compute the ccps. There has been an ongoing race between top-
down and bottom-up join enumeration concerning this challenge. Traditionally, all
partitioning strategies have been generate-and-test based. But depending on the shape
of the query graph, most of the generated partitions are not valid ccps, i.e., are filtered
out by the tests for connectivity. That is why those approaches are suboptimal and can
have an exponential overhead
4
.
In bottom-up join enumeration, all the connected subsets for a given set are al-
ready generated. Therefore, a partitioning strategy for dynamic programming that is
not generate-and-test based should be easier to design. Moerkotte and Neumann [22]
presented a dynamic programming variant called DP
CCP
, producing all partitions in
constant time O
(1) per valid ccp.
For top-down join enumeration via memoization, no such equally efficient solution
is known yet. Finding an analogous variant to DP
CCP
for memoization is very ap-
pealing, not only for the outcome of the race but also because the nature of top-down
processing can leverage the benefits of branch-and-bound pruning. The beauty of those
pruning strategies is that they are risk-free: They can speed up processing by several
orders of magnitude, while at the same time they preserve optimality of the final join
tree.
DeHaan and Tompa took up the challenge and proposed with M
IN
C
UT
L
AZY
[5]
a minimal graph cut partitioning algorithm for memoization. Nevertheless, TDM
C
L,
which is the generic top-down join enumeration algorithm instantiated with M
IN
C
UT
-
L
AZY
, cannot compete with DP
CCP
. The first contribution of this work are two par-
titioning algorithms for top-down join enumeration that close the performance gap to
DP
CCP
.
However, the proposed algorithms DP
CCP
and TDM
C
L are not ready to be used
in real-world scenarios yet because there exist severe limitations: First, as has been
argued in several places, hypergraphs must be handled by any plan generator [2, 27,
35]. Second, plan generators have to deal with outer joins and antijoins [14, 27].
In general, these operators are not freely reorderable, i.e., there might exist different
orderings, which produce different results. Because it has been shown that the non-
inner join reordering problem can be reduced to hypergraphs, it remains the top goal of
4
In case of a chain query for example, the naive generate-and-test based approach for top-down join
enumeration generates
2
|V |+2
- |V |
2
- 3 |V | - 4 partitions but only
|V |
3
-|V |
3
are valid
ccps [26].
16
any plan generator to deal with hypergraphs [2, 21, 27, 20]. Consequently, Moerkotte
and Neumann [21] extended DP
CCP
to DPH
YP
to handle hypergraphs.
The second main contribution of this work is a generic partitioning framework that
transforms hypergraphs to restrictive graphs and applies some further modifications.
The advantage of this approach is that existing, well performing graph-partitioning
algorithms can be reused in order to efficiently handle hypergraphs.
As the third and last contribution of this book, we present advancements to the
known branch-and-bound pruning strategies.
As will be shown, all combined contributions of this book result in a performance
advantage of
100% over DPH
YP
by considering the TPC-H [34], TPC-DS [33] and
the SQLite test suite [29] benchmarks. For syntactic workloads we present average
runtime improvements by orders of magnitude.
The detailed contributions together with the outline of this book are described in the
following section.
1.2. Contribution
1.2.1. Graph-Aware Join Enumeration Algorithms
In Chapter 2, we give a general introduction to top-down join enumeration. We ex-
plain a naive approach and give a complexity analysis that motivates three new graph-
partitioning strategies. For the last partitioning algorithm, we show that it has a com-
plexity in O
(1) per emitted ccp for acyclic and standard query graphs. A performance
evaluation concludes this chapter, showing that two of the three partitioning algorithms
are competitive with DP
CCP
. The following publications contributed to this chapter:
[9]
Pit Fender and Guido Moerkotte. Reassessing top-down join enumeration.
IEEE Transactions on Knowledge and Data Engineering, 24(10):18031818,
2012
[12]
Pit Fender, Guido Moerkotte, Thomas Neumann, and Viktor Leis. Effective
and robust pruning for top-down join enumeration algorithms. In Proceedings
of the 28th International Conference on Data Engineering, pages 414425,
2012
[8]
Pit Fender and Guido Moerkotte. A new, highly efficient, and easy to im-
plement top-down join enumeration algorithm. In Proceedings of the 27th
International Conference on Data Engineering, pages 864875, 2011
1.2.2. Hypergraph-Aware Join Enumeration Algorithms
We start Chapter 3 by motivating why the handling of hypergraphs is indispensable.
After that, we adjust the naive top-down join enumeration algorithm of Chapter 2 and
explain the necessary changes. We continue with a description of the first hypergraph-
aware partitioning algorithm. Then we present our main contribution: a generic parti-
tioning framework that enables graph-aware partitioning algorithms to cope with hy-
pergraphs. We show how the partitioning strategies of Chapter 2 can be reused. Then
we conclude with a performance evaluation that includes the runtime results of the
17
TPC-H, TPC-DS and the SQLite test suite benchmarks. The hypergraph-aware parti-
tioning algorithm and the generic framework have already been published:
[11]
Pit Fender and Guido Moerkotte. Top-down plan generation: From theory
to practice. In Proceedings of the 29th International Conference on Data
Engineering, pages 11051116, 2013
[10]
Pit Fender and Guido Moerkotte. Counter strike: Generic top-down join enu-
meration for hypergraphs. Proceedings of the VLDB Endowment, 6(14):1822
1833, September 2013
1.2.3. Branch-and-Bound Pruning
The main advantage of top-down join enumeration over bottom-up join enumeration
is that it allows for branch-and-bound pruning. Chapter 4 starts with an introduction
to branch-and-bound pruning. Then, we follow with seven advancements that improve
pruning (1) in terms of effectiveness, (2) in terms of robustness and (3) by avoiding
its potential worst case behavior otherwise observed. At the end of Chapter 4, we
give an in-depth performance evaluation. Furthermore, we give results for the TPC-H,
TPC-DS and the SQLite test suite benchmarks. We have published advancements and
results as follows:
[12]
Pit Fender, Guido Moerkotte, Thomas Neumann, and Viktor Leis. Effective
and robust pruning for top-down join enumeration algorithms. In Proceedings
of the 28th International Conference on Data Engineering, pages 414425,
2012
1.2.4. Conclusion and Appendix
We conclude this book in Chapter 5. Appendix A gives a complexity analysis of
the work of DeHaan and Tompa [5]. Furthermore, we include the C++ Code of two
partitioning algorithms in Appendix B.1 and B.2.
18
2. Graph-Aware Join Enumeration
Algorithms
For a precise description of our graph-aware join enumeration algorithms, we have to
give some fundamentals. We start with the preliminaries in Section 2.1. An intro-
duction to top-down join enumeration by explaining a naive memoization algorithm
follows in Section 2.2. After that, Section 2.3 presents our first graph-aware top-down
join enumeration algorithm based on an advanced-generate-and-test paradigm. We
improve upon that approach by presenting conservative graph partitioning in Section
2.4. Finally, Section 2.5 explains branch partitioning which has a complexity that is in
O
(1) for standard queries per emitted ccp. We conclude this chapter with an extensive
performance evaluation in Section 2.6.
2.1. Preliminaries
2.1.1. Graphs
We start with the notion of a graph as the basis for the definitions that follow.
Definition 2.1.1. An undirected graph G
= (V, E) is defined by a set of vertices V
and a set of edges E. The set of edges E is a set of unordered pairs
(v, w) for which
v, w
V with v = w holds. We assume that the nodes in V are totally ordered via an
(arbitrary) relation
.
For an edge
(v, w), v and w are said to be adjacent to each other, and the edge is
said to be incident with v and w. The next definition specifies what we mean by an
index of a vertex.
Definition 2.1.2. Let G
= (V, E) be an undirected graph with v
i
V . Further let
be a binary relation specifying the total order of the vertices of V . We say i is the
index of v
i
with i
= |{w v
i
|w V }|.
We continue with the notion of node-induced subgraphs.
Definition 2.1.3. Let G
= (V, E) be an undirected graph and V V a subset of
nodes. The node-induced subgraph G
|
V
of G is defined as G
|
V
= (V , E ) with
E
= {(v, w) | (v, w) E v V w V }. The node ordering on V is the
restriction of the node ordering of V .
Having defined a graph, we now specify what we mean by a path between vertices
v
0
and v
l
in G
= (V, E).
Definition 2.1.4. Let G
= (V, E) be an undirected graph, then a path u
w with the
length l between vertices u and w is defined as a sequence of vertices v
0
, v
1
, v
2
, ..., v
l
19
in V such that u
= v
0
and w
= v
l
and
(v
i
-1
, v
i
) E for i = 1, 2, ...l. The length l of
the path is the number of edges
(v
0
, v
1
), (v
1
, v
2
), ..., (v
l
-1
, v
l
) in the path.
With the definition of a path, we are able to give the notion of a cycle.
Definition 2.1.5. Let G
= (V, E) be an undirected graph, then a cycle is a path
v
0
, v
1
, v
2
, ..., v
l
with
0il
v
i
V where v
0
= v
l
holds.
We make the following observation:
Observation 2.1.6. Let G
= (V, E) be an undirected graph, then a path
v
0
, v
1
, v
2
, ..., v
l
is free of cycles if
0i<jl
v
i
= v
j
holds.
Through the definition of a path, we can express the notion of a connected
(sub)graph.
Definition 2.1.7. Let G
= (V, E) be an undirected graph. If there exists a path be-
tween each pair of vertices of V , then G is called connected.
There is an alternative way to specify whether a graph is connected.
Observation 2.1.8. Let G
= (V, E) be an undirected graph. G is connected if |V | = 1
or if there exists a partitioning V , V
of V and an edge
(v, w) E such that v V ,
w
V , and both G|
V
and G
|
V
are connected.
For this, we need the notion of the direct and indirect neighborhood.
Definition 2.1.9. Let G
= (V, E) be an undirected graph, then the neighborhood of a
vertex v
V is defined as
N (v) = {w | w V (v, w) E}.
We define the neighborhood of a set:
Definition 2.1.10. Let G
= (V, E) be an undirected graph, then the neighborhood of
a set S
V is defined as:
N (S) = {w | v S w (V \ S) (v, w) E}.
And a set's indirect neighborhood is defined by:
Definition 2.1.11. Let G
= (V, E) be an undirected graph, then the indirect neigh-
borhood of a set S
V is defined as:
N
0
(S) = S
N
1
(S) = N (S)
N
i
+1
(S) = N (N
i
(S)) \ (
j
=0...i
N
j
(S)).
In the following observation, we give another way of testing a graph G for connec-
tivity:
20
Observation 2.1.12. An undirected graph G
= (V, E) is connected only if for an
arbitrary vertex v
V , the set of all generations of neighbors of {v} equals V :
V
=
0i<|V |
N
i
({v}) .
We give the notion of a partition of the vertex set V .
Definition 2.1.13. Let G
= (V, E) be a connected undirected graph, V
1
V with
V
1
= and V
2
= V \ V
1
, then
(V
1
, V
2
) is called a partition of V .
Fundamental for the algorithms described in this book is the notion of a graph cut.
Definition 2.1.14. Let G
= (V, E) be a connected undirected graph and (V
1
, V
2
) a
partition of V . The set of all edges E
cut
= {(v
1
, v
2
) E | v
1
V
1
v
2
V
2
}
crossing this partition is called a graph cut.
It follows from the definition that a graph cut necessarily disconnects G into at least
two connected subgraphs. A special case of a graph cut is a minimal cut.
Definition 2.1.15. Let G
= (V, E) is a connected undirected graph and (V
1
, V
2
) a
partition of V . If both V
1
and V
2
induce connected subgraphs, then E
cut
= {(v
1
, v
2
)
E
| v
1
V
1
v
2
V
2
} is a minimal graph cut.
Let G
= (V, E) be a connected undirected graph and (V
1
, V
2
) a partition of V . The
set of edges produced by the graph cut to gain
(V
1
, V
2
) is specified with E
cut
(V
1
,V
2
)
=
{(v
1
, v
2
) E | v
1
V
1
v
2
V
2
}. Assume that G
|V
1
is not connected. Further,
let V
1
1
be a proper non-empty subset of V
1
. We can find a minimal cut producing an
edge set E
cut
(V
11
,V
\V
11
)
= {(v
1
, v
2
) E | v
1
V
1
1
v
2
V \ V
1
1
} that partitions
V
into
(V
1
1
, V
\ V
1
1
), and it holds that E
cut
(V
11
,V
\V
11
)
E
cut
(V
1
,V
2
)
and V
1
1
V
1
.
But if both V
1
and V
2
induce a connected graph, then the graph cut E
cut
(V
1
,V
2
)
cannot
contain another graph cut that is a proper subset. The following observation explains
why it is called minimal cut.
Observation 2.1.16. A minimal cut contains no other graph cuts as a proper subset.
There is a special type of vertex, called an articulation vertex of a connected graph
G
. Removing such a vertex from V disconnects the graph into at least two connected
subgraphs. We give its definition.
Definition 2.1.17. Let G
= (V, E) be an undirected connected graph. A vertex a V
is called articulation vertex if there exist two vertices v
V and w V , such that
every path v
w in V must contain a.
The articulation vertices of a connected graph G
= (V, E) are important when
determining the biconnected components of a graph.
Definition 2.1.18. Let G
= (V, E) be a connected undirected graph. A biconnected
component is a connected subgraph G
BCC
i
=(V
i
, E
i
) of G with V
i
= {v | (v = u
v
= w) (v, w) E
i
}, where the set of edges E
i
E is maximal such that any
two distinct edges
(u, w) E
i
and
(x, y) E
i
lie on a cycle v
0
, v
1
, v
2
, ..., v
l
, where
u
= v
0
u = v
l
w = v
1
x = v
j
-1
y = v
j
0 < j < l and
0i<j<l
v
i
= v
j
holds. If for an edge
(u, w) E
i
no such cycle exists, the vertices u, w
V
i
induce a
biconnected component G
BCC
i
= ({u, w}, {(u, w)}).
21
R
1
R
2
R
5
R
0
R
9
R
10
R
3
R
4
R
8
R
7
R
6
Figure 2.1.: Graph
of
eleven
relations,
five
biconnected
components
G
BCC
|{R
0
,R
1
,R
2
,R
3
,R
7
,R
8
}
, G
BCC
|{R
3
,R
4
}
, G
BCC
|{R
4
,R
5
,R
6
}
, G
BCC
|{R
8
,R
9
}
and G
BCC
|{R
9
,R
1
0}
and four articulation vertices R
3
, R
4
, R
8
, R
9
We make the following observation:
Observation 2.1.19. Let G
= (V, E) be a connected undirected graph with k bi-
connected components. Then the edge set E can be divided into equivalence classes
E
1
, E
2
, ...E
k
such that each subset E
i
spans one of the k biconnected components
G
BCC
i
= (V
i
, E
i
) of G.
Furthermore, we observe:
Observation 2.1.20. Two biconnected components G
i
and G
j
cannot have more than
one vertex in common. If they have one vertex a in common, then a has to be an
articulation vertex, and V
i
V
j
= {a} holds.
We exemplify the notion of biconnected components and articulation vertices in
Figure 2.1. The connected undirected graph G
= (V, E) with |V | = 11 has five
biconnected components which are: G
BCC
|{R
0
,R
1
,R
2
,R
3
,R
7
,R
8
}
, G
BCC
|{R
3
,R
4
}
, G
BCC
|{R
4
,R
5
,R
6
}
,
G
BCC
|{R
8
,R
9
}
and G
BCC
|{R
9
,R
10
}
. There are four articulation vertices: R
3
, R
4
, R
8
, R
9
. We
can divide the edge set E in five equivalence classes, which are: E
1
= {(R
0
, R
1
), (R
1
, R
2
), (R
2
, R
3
), (R
3
, R
7
), (R
7
, R
8
), (R
0
, R
8
)}, E
2
= {(R
3
, R
4
)}, E
3
= {(R
4
, R
5
),
(R
5
, R
6
), (R
4
, R
6
)}, E
4
= {(R
8
, R
9
)} and E
5
= {(R
9
, R
10
)}.
2.1.2. Query Graphs, Plan Classes and Costs
In this subsection, we introduce the notion of a query graph which serves as an input
for our plan generation algorithms. After that, we specify the notion of a plan class.
Finally, we define the C
out
cost function.
The notion of query graph is synonymously used for graphs and hypergraphs. For
the definition of a hypergraph, we refer to Section 3.2.1. A query graph is defined as
follows:
Definition 2.1.21. Given a query and a set of join predicates and let
R =
{R
1
, R
2
..., R
n
} be a set of n relations specified by the query, we define the query
22
graph G
= (V, E) representing the query as an undirected graph or undirected hy-
pergraph, where
R is represented by the vertex set V and the set of join predicates by
an edge set E
= {(v, w) | v and w represent relations that are referenced by a join
predicate
}.
The result of the algorithms presented here is a join tree. A join tree specifies a
certain execution order of join operations.
Definition 2.1.22. A join tree is a binary tree whose leaf nodes are the relations speci-
fied in a query and whose inner nodes are the join operations. The edges of a join tree
represent the input and output relationship.
An inner node corresponds to a binary join operation which has two subtrees as
input. The leaves of a join tree are the disjoint relations specified in the query. A join
tree or a join subtree may consist of just one node or, alternatively, is the root of two
subtrees.
From a join tree the query evaluation plan can be built. Essentially, the query evalu-
ation plan is an operator tree with physical algebraic operators as nodes. A plan class
groups equivalent plans. The plans within a plan class have the same logical proper-
ties. An import logical property is the set of relations the plan comprises. Another
logical property is the output cardinality of the plans. The optimal plan of a plan class
is the cheapest among all members of the class.
The result of all optimization algorithms presented here will be optimal. From all
possible optimal plans of the same plan class with the same lowest costs, the optimiza-
tion algorithms arbitrarily return one optimal solution.
We give the definition of a connected subgraph and its complement pair, which we
abbreviate with ccp.
Definition 2.1.23. Let G
= (V, E) be a connected query graph, (S
1
, S
2
) is a connect-
ed subgraph and its complement pair (or ccp for short) if the following holds:
· S
1
S
2
= ,
· S
1
with S
1
V induces a connected graph G
|S
1
,
· S
2
with S
2
V induces a connected graph G
|S
2
, and
· (v
1
, v
2
) E with v
1
S
1
v
2
S
2
.
The set of all possible ccps is denoted by P
ccp
. We introduce the notion of ccps for
a set S to specify all those pairs of input sets that result in the same output set S, if
joined.
Definition 2.1.24. Let G
= (V, E) be a connected query graph and S a set with
S
V that induces a connected subgraph G
|S
. For S
1
, S
2
V , (S
1
, S
2
) is called a
ccp for S if
(S
1
, S
2
) is a ccp and S
1
S
2
= S holds.
We can observe:
Observation 2.1.25. If
(S
1
, S
2
) is a ccp for S, then (S
1
, S
2
) is also a partition of S.
23
By P
ccp
(S), we denote the set of all ccps for S. Let G = (V, E) be a connected
query graph and
P
con
(V ) = {S V | G
|S
is connected
|S| > 1} be the set of all
connected subsets of V with more than one element, then P
ccp
=
S
P
con
(V )
P
ccp
(S)
holds.
If
(S
1
, S
2
) is a ccp, then (S
2
, S
1
) is one as well, and we consider them as symmetric
pairs. We are interested in the set P
sym
ccp
of all ccps, where symmetric pairs are ac-
counted for only once, e.g.,
(S
1
, S
2
) P
sym
ccp
if
MIN
index
(S
1
) <
MIN
index
(S
2
) holds,
or
(S
2
, S
1
) P
sym
ccp
otherwise. By
MIN
index
(S) we specify the minimal index i of all
vertices v
i
S (Definition 2.1.2). We give no constraints for choosing which one of
two symmetric pairs should be member of P
sym
ccp
, but leave this as a degree of free-
dom. Analogously, we denote the set of all ccps for a set S containing either
(S
1
, S
2
)
or
(S
2
, S
1
) by P
sym
ccp
(S).
With C
out
we give a simple cost function that sums up the cardinalities of the inter-
mediate results [3]. C
out
is defined as
C
out
(T ) =
0
if T is a single relation
|T | + C
out
(T
1
) + C
out
(T
2
) if T = T
1
T
2
2.1.3. Problem Specification
Since there are many different join ordering problems, we want to specify the problem
domain which we investigate here. We apply the well-accepted connectivity heuristic
[26] by considering the generation of optimal bushy join trees without cross products.
Therefore, the query graph has to be connected. Within this chapter, we allow only
select-project-join queries with simple join predicates that are not referencing more
than two relations. We discard physical properties like sortedness and groupedness and
assume that Bellman's Principle of Optimality holds. For each complexity analysis
done in this document, we specify the complexity of set operations
, , \ to be in
O
(1). When we assume that the word size is unlimited and we represent sets by using
bitvectors, the operations
, , \ can be implemented in a constant (and small) number
of instructions.
2.2. Basic Memoization
As an introduction to top-down join enumeration, we give a basic memoization variant
called TDB
ASIC
, which we derive by utilizing a generic top-down algorithm that in-
vokes a naive partitioning algorithm. In the first sub-subsection, we present our generic
top-down algorithm. Afterwards, we explain the naive partitioning strategy.
2.2.1. Generic Top-Down Join Enumeration
Our generic top-down join enumeration algorithm TDP
LAN
G
EN
is based on memo-
ization. We present its pseudocode in Figure 2.2. Like dynamic programming, TD-
P
LAN
G
EN
initializes the building blocks for atomic relations first (Line 2). Then, in
Line 3 the subroutine TDPGS
UB
is called, which traverses recursively through the
search space. As the name implies, top-down enumeration starts with the vertex set
V
. Hence at the root invocation of TDPGS
UB
, the vertex set S corresponds to the
24
vertex set V of the query graph. At every recursion step of TDPGS
UB
, all possible
join trees of two optimal subjoin trees that together comprise the relations of S are
built through B
UILD
T
REE
(Line 3) that we explain later, and the cheapest join tree is
kept. We enumerate the optimal subjoin trees by iterating over the elements
(S
1
, S
2
)
of P
sym
ccp
(S) in Line 2. This way, we derive the two optimal subjoin trees, each com-
prising exactly the relations in S
1
or S
2
, respectively, by recursive calls to TDPGS
UB
.
Generating P
sym
ccp
(S) is the task of a partitioning algorithm. Depending on the choice
of the partitioning strategy, the overall performance of TDP
LAN
G
EN
can vary by or-
ders of magnitude.
The recursive descent stops when either
|S| = 1 or TDPGS
UB
has already been
called for that G
|S
. In both cases, the optimal join tree is already known. To pre-
vent TDPGS
UB
from computing an optimal tree twice, BestT ree
[S] is checked in
Line 1. BestT ree
[S] yields a reference to an entry in an associative data structure
called memotable. The data structure "memoizes" the optimal join tree generated for a
set S. If BestT ree
[S] equals
NULL
, this invocation of TDPGS
UB
will be the first one
with G
|S
as input, and the optimal join tree of G
|S
has not been found yet. Otherwise,
BestT ree
[S] will hold an optimal plan for the plan class of S. Note that during the
optimization of S the optimal plan for the plan class might not be found yet because
there are still ccps for S to be considered. This means that during the process of loop-
ing over P
sym
ccp
(S) (Line 2) BestT ree[S] does not necessarily hold the optimal join
tree for S, but the cheapest tree found so far.
TDP
LAN
G
EN
(G)
£ Input: connected G=(V,E), V =
1i|V |
{R
i
}
£ Output: an optimal join tree for G
1
for i
1 to n
2
BestT ree
({R
i
}) R
i
3
return TDPGS
UB
(V )
TDPGS
UB
(G
|S
)
£ Input: connected sub graph G
|S
£ Output: an optimal join tree for G
|S
1
if BestT ree
[S] =
NULL
2
for all
(S
1
, S
2
) P
sym
ccp
(S)
3
do
BUILD
T
REE
(S, TDPGS
UB
(G
|S
1
), TDPGS
UB
(G
|S
2
))
4
return BestT ree(S)
Figure 2.2.: Pseudocode for TDP
LAN
G
EN
The pseudocode of
BUILD
T
REE
is given in Figure 2.3. It is used to compare
the cost of the join trees that belong to the same G
|S
. Since the symmetric pairs
(S
1
, S
2
) and (S
2
, S
1
) (Line 2 of TDPGS
UB
) are enumerated only once, we have to
build two join trees (Line 1 and Line 4) and then compare their costs. We use the
method C
REATE
T
REE
, which takes two disjoint join trees as arguments and com-
bines them to a new join tree. If different join implementations have to be con-
sidered, among all alternatives the cheapest join tree has to be built by C
REATE
-
T
REE
. If the created join tree (Line 1) is cheaper than BestT ree
[S], or even no
25
BUILD
T
REE
(S, T ree
1
, T ree
2
)
£ Input: connected vertex set S, two sub join trees
1
CurrentT ree
C
REATE
T
REE
(T ree
1
, T ree
2
)
2
if BestT ree
[S] =
NULL
cost(BestT ree[S]) > cost(CurrentT ree)
3
BestT ree
[S] CurrentT ree
4
CurrentT ree
C
REATE
T
REE
(T ree
2
, T ree
1
)
5
if cost
(BestT ree[S]) > cost(CurrentT ree)
6
BestT ree
[S] CurrentT ree
Figure 2.3.: Pseudocode for
BUILD
T
REE
P
ARTITION
naive
(G
|S
)
£ Input: a connected node induced (sub) graph G
|S
£ Output: P
sym
ccp
(S)
1
for all C
S C =
2
if
MIN
index
(C) <
MIN
index
(S \ C)
IS
C
ONNECTED
(G
|C
)
IS
C
ONNECTED
(G
|S\C
)
3
emit
(C, S \ C)
Figure 2.4.: Pseudocode for naive partitioning
tree for S has been built yet, BestT ree
[S] gets registered with the CurrentT ree.
For building the second tree, we just exchange the arguments (Line 4). Again, the
costs of the new join tree are compared to the costs of BestT ree
[S]. Only if the
new join tree has lower costs, BestT ree
[S] gets registered with the new join tree.
Note that because of Line 3, BestT ree
[S] in Line 6 cannot be
NULL
. Estimating
the costs of the two possible join trees at the same time rather than separately and
comparing them is more efficient, e.g., for cost functions as given in [17], where
card
(T
x
) card(T
y
) cost(T
x
I T
y
) cost(T
y
I T
x
) holds, with card is
the number of tuples or pages and T
x
, T
y
are (intermediate) relations.
2.2.2. Naive Partitioning
As we have already seen, the generic top-down enumeration algorithm iterates over
the elements of P
sym
ccp
(S). Now, we show how the ccps for S can be computed by a
naive generate-and-test strategy. We call our algorithm P
ARTITION
naive
and give its
pseudocode in Figure 2.4. In Line 1, all
2
|S|
-2 possible non-empty and proper subsets
of S are enumerated. For rapid subset enumeration, the method described in [36] can
be used. We demand that from every symmetric pair only one ccp is emitted. There
are many possible solutions, but we make sure that the relation with the lowest index
(Definition 2.1.2) is always contained in C in Line 2. Three conditions have to be met
so that a partition
(C, S \ C) is a ccp. We check the connectivity of G
|C
and G
|S\C
in
Line 2. The third condition that C needs to be connected to S
\ C is ensured implicitly
by the requirement that the graph G
|S
handed over as input is connected.
26
IS
C
ONNECTED
(G
|C
)
£ Input: a node induced subgraph G
|C
£ Output:
TRUE
if G
|C
is connected,
FALSE
otherwise
1
T
{arbitrary element t of C}
2
while
(N (T ) C) =
3
T
T (N (T ) C)
4
return T
= C
Figure 2.5.: Pseudocode for
IS
C
ONNECTED
To check for connectivity of the node-induced subgraphs, Observation 2.1.12 can be
used. The pseudocode follows directly and is given in Fig. 2.5 with
IS
C
ONNECTED
.
2.2.3. Exemplified Execution of TDB
ASIC
As already mentioned, we refer to the generic top-down join enumeration algorithm
TDP
LAN
G
EN
instantiated with P
ARTITION
naive
as TDB
ASIC
. Now we exemplify
the execution of TDB
ASIC
with the query graph of Figure 2.6. In Figure 2.7 we
give the selectivities and cardinalities. For the comparison of equivalent join trees
we use the C
out
cost function (Section 2.1.2). Table 2.1 shows the different states
during execution. Thereby the first column is the table entry that serves as reference.
The second column displays the recursion level, with
0 indicating the root invocation.
The input parameter S is shown in the third column and the current ccp that is being
processed is displayed by the fourth column. In other words, Table 2.1 shows the order
in which the ccps are enumerated. We can observe that the order is not only top-down
but also demand-driven.
Table 2.2 shows a different perspective of the enumeration process by listing the
order of calls to
BUILD
T
REE
. Again, the first column serves as entry of reference. We
give the current ccp for that the call to
BUILD
T
REE
was made in the second column.
With the third column, we reference the corresponding entry of Table 2.1. The fourth
column displays the current join tree. Here, we ignore the join tree that we gain by
applying commutativity, since it will have the same costs. The cost for cT are given
in the fifth column. The sixth column displays the table entry computing the cheapest
tree known so far. This entry corresponds to the current value of BestT ree.
It should not be surprising that Table 2.1 has the same amount of entries as Ta-
ble 2.2. We can observe that the order in which the (sub) join trees are computed
is bottom-up. For this example, the optimal join tree for V
= {R
0
, R
1
, R
2
, R
3
}
is R
0
(R
1
(R
2
R
3
)) with the cost of 21. The most expensive join tree for V is
((R
0
R
1
) R
2
) R
3
with the cost of
11001, which is a huge difference. The most
expensive join tree with the costs of
110000 is R
0
(R
1
R
2
). We can see that because
of memoization, this subtree was not considered when
((R
0
R
1
) R
2
) R
3
was com-
puted, because BestT ree
[{R
0
, R
1
, R
2
}] points to (R
0
R
1
) R
2
, which is only
1
10
of
the costs.
27
R
0
R
1
R
2
R
3
Figure 2.6.: Example graph G
= (V, E) with V = {R
0
, R
1
, R
2
, R
3
} and E =
{(R
0
, R
1
), (R
1
, R
2
), (R
1
, R
3
), (R
2
, R
3
)}
Relation
Cardinality
R
0
1
R
1
10000
R
2
100
R
3
10
Edge
Selectivity
(R
0
, R
1
) 0.1
(R
1
, R
2
) 0.1
(R
1
, R
3
) 0.001
(R
2
, R
3
) 0.01
Figure 2.7.: Selectivity and Cardinalities for the Query Graph of Figure 2.6
For the purpose of completeness, we list all ccps in Table 2.3. In particular, we show
for every connected subgraph G
|S
that is considered by TDB
ASIC
the corresponding
set P
sym
ccp
(S).
2.2.4. Analysis of TDB
ASIC
In this section, we give a brief analysis of TDB
ASIC
. Therefore, we count the number
of iterations
#I of the loop in Line 1 of P
ARTITION
naive
summed up over all invoca-
tions of P
ARTITION
naive
for all S. Here, the number
#I depends on the shape of the
query graph. We analyzed
#I in our experimental studies for different graph shapes
and deduced the following formulas:
For chains, we have
#Ichain
TDBasic(|
V
|) = 2
|V |+2
- |V |
2
- 3 |V | - 4.
For stars, we have
#Istar
TDBasic(|
V
|) = 2 3
|V |-1
- 2
|V |
.
For cycles, we have
#I
cycle
TDBasic(|
V
|) = |V | 2
|V |
+ 2
|V |
- 2|V |
2
- 2.
For cliques, we have
#I
cliques
TDBasic(|
V
|) = 3
|V |
- 2
|V |+1
+ 1.
These formulas are identical to Moerkotte's and Neumann's analysis of the dynamic
programming variant DPS
UB
[22].
For analyzing the algorithm's performance, we have to account for the number of
times the enumerated partitions pass the test for connectivity. This is equal to the num-
ber of times
BUILD
P
LAN
is called, which we abbreviate with
#bP . Since we avoid to
enumerate symmetric partitions twice, this number is half the number of existing ccps.
28
Entry L
S
current ccp
1
0 {R
0
, R
1
, R
2
, R
3
} ({R
0
}, {R
1
, R
2
, R
3
})
2
1
{R
1
, R
2
, R
3
}
({R
1
}, {R
2
, R
3
})
3
2
{R
2
, R
3
}
({R
2
}, {R
3
})
4
1
{R
1
, R
2
, R
3
}
({R
1
, R
2
}, {R
3
})
5
2
{R
1
, R
2
}
({R
1
}, {R
2
})
6
1
{R
1
, R
2
, R
3
}
({R
1
, R
3
}, {R
2
})
7
2
{R
1
, R
3
}
({R
1
}, {R
3
})
8
0 {R
0
, R
1
, R
2
, R
3
} ({R
0
, R
1
}, {R
2
, R
3
})
9
1
{R
0
, R
1
}
({R
0
}, {R
1
})
10
0 {R
0
, R
1
, R
2
, R
3
} ({R
0
, R
1
, R
2
}, {R
3
})
11
1
{R
0
, R
1
, R
2
}
({R
0
}, {R
1
, R
2
})
12
1
{R
0
, R
1
, R
2
}
({R
0
, R
1
}, {R
2
})
13
1
{R
0
, R
1
, R
2
}
({R
0
, R
1
, R
3
}, {R
2
})
14
2
{R
0
, R
1
, R
3
}
({R
0
}, {R
1
, R
3
})
15
2
{R
0
, R
1
, R
3
}
({R
0
, R
1
}, {R
3
})
Table 2.1.: Exemplified execution of TDB
ASIC
for the input graph of Figure 2.6
B
T
current ccp
ref
Entry
cT
cost
(cT ) ref
B
T
1
({R
2
}, {R
3
})
3
R
2
R
3
10
1
2
({R
1
}, {R
2
, R
3
})
2
R
1
(R
2
R
3
)
20
2
3
({R
1
}, {R
2
})
5
R
1
R
2
100000
3
4
({R
1
, R
2
}, {R
3
})
4
(R
1
R
2
) R
3
)
100010
2
5
({R
1
}, {R
3
})
7
R
1
R
3
100
5
6
({R
1
, R
3
}, {R
2
})
6
(R
1
R
3
) R
2
110
2
7 ({R
0
}, {R
1
, R
2
, R
3
})
1
R
0
(R
1
(R
2
R
3
))
21
7
8
({R
0
}, {R
1
})
9
R
0
R
1
1000
8
9 ({R
0
, R
1
}, {R
2
, R
3
})
8
(R
0
R
1
) (R
2
R
3
)
1011
7
10
({R
0
}, {R
1
, R
2
})
11
R
0
(R
1
R
2
)
110000
10
11
({R
0
, R
1
}, {R
2
})
12
(R
0
R
1
) R
2
11000
11
12 ({R
0
, R
1
, R
2
}, {R
3
})
10
((R
0
R
1
) R
2
) R
3
11001
7
13
({R
0
}, {R
1
, R
3
})
14
R
0
(R
1
R
3
)
110
13
14
({R
0
, R
1
}, {R
3
})
15
(R
0
R
1
) R
3
1010
13
15 ({R
0
, R
1
, R
3
}, {R
2
})
13
(R
0
(R
1
R
3
)) R
2
111
7
Table 2.2.: Multiple calls to B
UILD
T
REE
during the enumeration process for the input
graph of Figure 2.6
The numbers again depend on the shape of the query graph but are independent of the
partitioning algorithm.
For chains, stars and cliques, we refer to Ono and Lohman [26] and for cycles to
Moerkotte and Neumann [22]. We give the formula of [22] for cycles divided by two,
since we are only interested in one ccp out of each symmetric pair of ccps.
For chains, we have
#bP chain(|V |) =
|V |
3
- |V |
6
.
29
S
P
sym
ccp
(S)
{R
0
,R
1
,R
2
,R
3
} {({R
0
},{R
1
,R
2
,R
3
}),({R
0
,R
1
},{R
2
,R
3
}),({R
0
,R
1
,R
3
},{R
2
})}
{R
1
,R
2
,R
3
} {({R
1
},{R
2
,R
3
}),({R
1
,R
2
},{R
3
}),({R
1
,R
3
},{R
2
})}
{R
2
,R
3
}
{({R
2
},{R
3
})}
{R
1
,R
2
}
{({R
1
},{R
2
})}
{R
1
,R
3
}
{({R
1
},{R
3
})}
{R
0
,R
1
}
{({R
0
},{R
1
})}
{R
0
,R
1
,R
2
} {({R
0
},{R
1
,R
2
}),({R
0
,R
1
},{R
2
})}
{R
0
,R
1
,R
3
} {({R
0
},{R
1
,R
3
}),({R
0
,R
1
},{R
3
})}
Table 2.3.: All connected (sub) sets S and the corresponding P
sym
ccp
(S) during the exe-
cution of TDB
ASIC
for the input graph of Figure 2.6
Chain
Star
Cycle
Clique
|V |
#I #bP
#I
#bP
#I #bP
#I
#bP
3
10
4
10
4
12
6
12
6
4
32
10
38
12
46
18
50
25
5
84
20
130
32
140
40
180
90
6
198
35
422
80
374
75
602
301
7
438
56
1330
192
924
126
1932
966
8
932
84
4118
448
2174
196
6050
3025
9
1936
120
12610
1024
4956
288
18660
9330
10
3962
165
38342
2304
11062
405
57002
28501
11
8034
220
116050
5120
24332
550
173052
86524
12
16200
286
350198
11264
52958
726
523250
261625
13
32556
364
1054690
24576
114348
936
1577940
788970
14
65294
455
3172262
53248
245366
1183
4750202
2375101
15
130798
560
9533170
114688
523836
1470
14283372
7141686
16
261836
680
28632278
245760
1113598
1800
42915650
21457825
17
523944
816
85962370
524288
2358716
2176
128878020
64439010
18
1048194
1938
258018182
1114112
4980086
2601
386896202
193448101
19
2096730
1140
774316690
2359296
10485036
3078
1161212892
580606446
20
4193840
1330
2323474358
4980736
22019294
3610
3484687250
1742343625
Table 2.4.: Sample values for inner counter and number of calls to
BUILD
P
LAN
For stars, we have
#bP star(|V |) = (|V | - 1) 2
|V |-2
.
For cycles, we have
#bP cycle(|V |) =
|V |
3
- 2 |V |
2
+ |V |
2
.
For cliques, we have
#bP clique(|V |) =
3
|V |
- 2
|V |+1
+ 1
2
.
Table 2.4 compares the number series for the inner counter and the number of calls
to
BUILD
P
LAN
.
30
The number of calls to
BUILD
P
LAN
#bP is the lower bound for any join enumer-
ation algorithm, no matter if it works top-down or bottom-up. As half of the enu-
merated partitions in P
ARTITION
naive
are discarded in Line 1, I and
#bP differ at
least by a factor of two. Considering the counters for clique queries in Table 2.4, we
see that the factor is exactly two because every second partition is discarded through
min
index
(C) < min
index
(S \ C) in Line 2 of TDB
ASIC
S
UB
, but every partition
which qualifies passes the two tests for connectivity since in a clique, every vertex is
connected to every other vertex.
We can conclude that P
ARTITION
naive
has to enumerate all possible subsets C
(Line 1), which is more than the number of existing ccps by several orders of mag-
nitude. The only exception to this observation are clique queries and query graphs
with many cycles. Thus, TDB
ASIC
is too inefficient to be useful, since the worst case
complexity of P
ARTITION
naive
is in O
(2
|V |
) per emitted ccp. Having that in mind,
the question arises whether it is possible to construct a partitioning algorithm that enu-
merates only the ccps for S and avoids to enumerate symmetric pairs twice. With
TDM
IN
C
UT
L
AZY
or TDM
C
L for short, DeHaan and Tompa [5] describe a partition-
ing algorithm that has a worst case complexity of only O
(|V |
2
) [8, 9] per emitted
ccp
. A short description and a subsequent analysis of M
IN
C
UT
L
AZY
can be found in
Appendix A.
The next three sections describe three new partitioning algorithms. In Section 2.3,
we explain how we can improve efficiency of naive partitioning by utilizing a query-
graph-aware approach implemented in M
IN
C
UT
AG
A
T [9]. Section 2.4 describes how
M
IN
C
UT
C
ONSERVATIVE
[12] further improves upon that basic idea. Section 2.5 in-
troduces M
IN
C
UT
B
RANCH
[8] and shows that M
IN
C
UT
B
RANCH
has only a com-
plexity of O
(1) per emitted ccp for arbitrary acyclic query graphs, cycle und clique
queries.
2.3. Advanced Generate and Test
In this section, we describe a novel partitioning algorithm named M
IN
C
UT
AG
A
T.
M
IN
C
UT
AG
A
T was designed by adopting the basic concept of optimistic partitioning
[6] that was shown to be incomplete [9].
2.3.1. TDM
IN
C
UT
AG
A
T
The pseudo-code for the new partitioning algorithm is given in Figure 2.8. We de-
note the instantiated generic top-down join enumeration variant by TDM
IN
C
UT
AG
A
T
or TDM
C
A for short. Like naive partitioning, M
IN
C
UT
AG
A
T relies on a generate-
and-test based approach. At the same time, it is also more sophisticated by utilizing
adjacency information of the query graph G
= (V, E).
The partitioning algorithm is invoked by P
ARTITION
M inCutAGaT
, which in turn im-
mediately calls the recursive component M
IN
C
UT
AG
A
T. The main idea is to succes-
sively enlarge a connected set C with one of its neighbors at every recursive iteration.
Since only adjacent vertices are added to C, we ensure that C remains connected at
any time. Hence, M
IN
C
UT
AG
A
T needs to rely on only one connection test, which is
to ensure the connectedness of C' s complement S
\ C.
31
The initial set for C consisting of an arbitrary vertex t is handed over in Line 2
of P
ARTITION
M inCutAGaT
. Since S and C
S are connected, for every possible
complement S
\ C there must exist a join edge (v
1
, v
2
), where v
1
C and v
2
(S \ C) holds. If S \ C is connected, then the partition (C, S \ C) is a ccp and can
be emitted (Line 1 of M
IN
C
UT
AG
A
T). As a requirement implied by the definition of
P
sym
ccp
(S), (1) duplicate ccps are prohibited and (2) symmetric ccps have to be emitted
only once. The latter constraint is ensured automatically because the start vertex t is
always element of C so that t
(S \ C) holds. For the first constraint, we introduce
an exclusive filter set X that keeps track of vertices v added to C in different branches
of recursion. So once a vertex v is added to X (Line 10) in an ancestor invocation of
M
IN
C
UT
AG
A
T, it cannot be chosen as a neighbor again and is filtered out (Line 8).
If there is only one vertex left in S that is not already a member of C, we can stop
our recursive descent because otherwise, C's complement would be empty in the next
child invocation. We check for this condition in Line 5 and exit if it is true.
Once an articulation vertex a
(S\C) which belongs to more than two biconnected
components G
BCC
i
, G
BCC
j
is added to C, no partition is emitted in the next series of
recursive calls of M
IN
C
UT
AG
A
T. If k is the number of biconnections G
BCC
that
contain a, then no partition is emitted until C is enlarged with k
-1 vertex sets V
i
,
1
i
k, of G
BCC
i
= (V
i
, E
i
) with a V
i
including the vertex sets of those biconnected
components that are directly and indirectly connected with G
BCC
i
= (V
i
, E
i
) through
paths not containing a. In other words, this means the first partition that is emitted
after an articulation vertex a was added to C is produced by a minimal graph cut E
cut
.
Thereby E
cut
contains only edges
(a, x
i
) such that a C x
i
S \ C holds.
For M
IN
C
UT
AG
A
T, we introduce an improved connection test which we call
IS
-
C
ONNECTED
I
MP
(invoked in Line 1). Instead of ensuring that all vertices in a com-
plement S
\ (C {v}) are connected to each other, our novel test ensures only that
the neighbors
N (C {v}) are connected to each other within the complement. In
other words, we check for a weaker condition that tests if from an arbitrary vertex w
with w
N (C {v}) all other vertices u of N (C {v}) \ {w} are reachable within
the complement S
\ (C {v}), i.e. if there exists a path for every u connecting w
with u, but not containing any vertices of C
{v}. We can even further weaken this
condition once we know that the complement S
\ C is already proven to be connected.
In such cases, it is sufficient to check if all elements of
N ({v}) \ C are connected
within S
\ (C {v}). On average,
IS
C
ONNECTED
I
MP
is cheaper to execute than a
common connection test that was given with
IS
C
ONNECTED
in Section 2.2.2. In the
best case, it is in O
(1). The worst case is identical to the complexity of the common
connection test, which is in O
(|S \ (C {v})|). We explain
IS
C
ONNECTED
I
MP
in
detail in Section 2.3.2.
To decide if it is sufficient to check only the neighbors of v, we introduce the set
T
. If the current partition
(C, S \ C) is a ccp (Line 1), then the connection test for the
next child invocations needs to check only for the neighbors of the v that is added to
C
. In that case, the next T is set to T
{v} (Line 3 and Line 9). Otherwise, all
neighbors of C
{v} need to be checked so that the next T is set to T C {v}
(Lines 4 and 9).
32
P
ARTITION
M inCutAGaT
(G
|S
)
£ Input: a connected (sub)graph G
|S
£ Output: P
sym
ccp
(S)
1
t
arbitrary vertex of S
2
M
IN
C
UT
AG
A
T
(G
|S
,
{t}, , {t})
M
IN
C
UT
AG
A
T
(G
|S
, C, X, T
)
£ Input: a connected (sub)graph G
|S
, C
X =
£ Output: ccps for S
1
if
IS
C
ONNECTED
I
MP
(S, C, T )
2
emit
(C, S \ C)
3
T
4
else T
C
5
if
|C| + 1 |S|
6
return
7
X
X
8
for v
(N (C) \ X)
9
M
IN
C
UT
AG
A
T
(G
|S
, C
{v}, X , T {v})
10
X
X {v}
Figure 2.8.: Pseudocode for M
IN
C
UT
AG
A
T
IS
C
ONNECTED
I
MP
(G
|S
, C, T
)
£ Input: (sub)graph G
|S
, C
S, T C
£ Output: if (S \ C) is connected
TRUE
, else
FALSE
1
N
(N (T ) S) \ C
2
if
|N| 1
3
return
TRUE
4
L
5
L
{n} : n N
6
while L
= L N L
7
D
L \ L
8
L
L
9
L
L ((N (D) S) \ C)
10
if N
L
11
return
TRUE
12
else return
FALSE
Figure 2.9.: Pseudocode for
IS
C
ONNECTED
I
MP
2.3.2. An Improved Connection Test
This section explains the novel connection test
IS
C
ONNECTED
I
MP
used by M
IN
-
C
UT
AG
A
T. The pseudo code is given in Figure 2.9.
33
The purpose of
IS
C
ONNECTED
I
MP
is to check if the neighbors of T that are not in C
are connected to each other within S
\C. In Line 1 of
IS
C
ONNECTED
I
MP
, we compute
those neighbors of T as the set N . If N contains only one element, then S
\ C must
be connected and the test returns with a positive result. As a starting point for the test,
we choose an arbitrary vertex n
N (Line 5) and assign it to L . The loop of Line 6
to 9 computes the indirect neighborhood (Definition 2.1.11) of
{n}. This corresponds
to enlarging L with the neighborhood of L (Line 9) or current L (Line 8). Hereby
N
i
({n}) = N (L) within G
|S\C
holds. Furthermore, L holds the previous L (Line 8)
and D holds all the new elements of L , which directly corresponds to a generation of
neighbors. Note that except for the first iteration of the loop,
N (L) (S \ C) = D
holds.
Because of
|D| < |L |, we compute the next generation of neighbors from D instead
of L (Line 9), which is clearly more efficient.
Once L contains all vertices of N , the loop is interrupted (Line 6), and it is obvious
that all n
N are reachable within S \ C so that
TRUE
is returned (Line 11). If, on
the other hand, L cannot be enlarged further so that L
= L holds, we have computed
all indirect neighbors of
{n} (within S \ C) and meet the loop's other stop condition.
In that case, we could not reach every element of N , so that N
\ L = must hold and
we have to return
FALSE
(Line 12). Note that those vertices left in N
\ L must belong
to at least one different connected set that is only adjacent to C and not connected to
L
.
2.3.3. Correctness of Advanced Generate and Test
In this subsection, we prove the correctness of M
IN
C
UT
AG
A
T. But as a prerequisite,
we need to prove the correctness of I
S
C
ONNECTED
I
MP
first.
Correctness of I
S
C
ONNECTED
I
MP
Lemma 2.3.1. Algorithm I
S
C
ONNECTED
I
MP
terminates if G
= (V, E) is a connected
and finite query graph and G
|S
with S
V a connected (sub)graph.
Proof. There are two different exit points: (1) an early exit in Line 3 and (2) the exit
point at the end of I
S
C
ONNECTED
I
MP
(Lines 11 and 12). If
|(N (T ) S) \ C)| 1
holds, the first exit is chosen, and the algorithm terminates. Otherwise, the loop of
Lines 6 to 9 is entered. The loop terminates if L has not been enlarged in the loop's
previous iteration or if all neighbors of T -- that are in S but not in C -- have been
added to L so that N
L holds. Let us assume the worst case: D is reassigned
with only one one vertex set at a time so that
|D| = 1 holds. Let us further assume
that every time
|((N (T ) S) \ C) \ L | = 1 holds as well. Then there can be only
|S \ C| - 1 number of iterations of the loop. We have to subtract the 1, since |L | = 1
holds when the loop is entered. Therefore, the loop terminates at least after
|S \ C| - 1
iterations and the second exit point is reached. Furthermore, since S is a finite set,
every set operation used here, especially the computation of the neighborhood of any
subset D
S, must terminate. Hence, I
S
C
ONNECTED
I
MP
terminates.
Lemma 2.3.2. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. Further, let C be a non-empty subset of S with C
S, then
34
the complement S
\ C is connected if all neighbors of C are connected to each other
through paths containing only vertices of S
\ C.
Proof. By contradiction. We assume that the set S
\ C is not connected, although
all neighbors of C are connected to each other within S
\ C. Since the neighbors of
C
are connected through paths containing only vertices in S
\ C, there has to exist a
maximally enlarged connected set A
=
0i|S\C|
(N
i
(C)\C) as the union of all i-th
gernerations of neighbors of C that are in S
\ C. From this follows that A (S \ C)
and
N (C) A and A C = and N (A) \ C = hold. Because A is connected but
by assumption S
\C is not connected, A (S \C) must hold. Hence, there exists a set
of vertices L
= (S \ C) \ A with L = . Because all neighbors of C are contained in
A
, no vertex in L can be connected to any vertex in C. Furthermore, by the definition
of A and L,
(N (A) \ C) L = holds. But this contradicts the prerequisite that G
|S
is connected.
Lemma 2.3.3. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. Further, let C be a non-empty subset of S with C
S, then
the complement S
\ C is connected if and only if all neighbors of C are connected to
each other through paths containing only vertices of S
\ C.
Proof. Lemma 2.3.2 shows that if all neighbors of C are connected to each other
through paths containing only vertices of S
\ C, then this is a sufficient condition for
G
|S\C
being connected. What is left is to show that it is also a necessary condition.
This is proved by contradiction. Let us assume G
|S\C
is connected but the neighbors
of C are not connected to each other. By Definition 2.1.10, the neighborhood of C
has to be a subset of S
\ C. But if there are at least two neighbors of C for which no
connecting path in S
\ C exists, then this is by Definition 2.1.7 a contradiction to the
connectedness of G
|S\C
.
Lemma 2.3.4. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. Further, let C be a non-empty subset of S with C
S |C|+
2 |S| and let G
|C
and G
|S\C
be connected subgraphs. Moreover, let v be a vertex
so that v
S \ C v N (C) holds. Then the set (S \ C) \ {v} is connected if and
only if all neighbors
N ({v}) of v are connected to each other within (S \ C) \ {v}.
Proof. Let G
= (V , E ) be a connected and finite graph that is identical to the con-
nected subgraph G
|S\C
of G
= (V, E). Then G s vertex set V is defined by V =
S
\C, and G s edge set E is defined by E = {(v
1
, v
2
) | v
1
, v
2
V (v
1
, v
2
) E}.
Let a set S be assigned to S
= V and a set C be assigned to C = {v} with v V .
Following from Lemma 2.3.3, S
\ C = (S \ C) \ {v} is connected if and only if the
neighbors of C
= {v} are connected within S \ C .
Theorem 2.3.1. Algorithm I
S
C
ONNECTED
I
MP
is correct.
Proof. The algorithm is either called with the parameter T assigned to a connected
vertex set C or to a one vertex set
{v}, while v is the vertex recently added to C. From
Lemmas 2.3.3 and 2.3.4, we know that it is sufficient to examine just the connectivity
between the vertices that are in
N (T ) within S \ C, which is done in Lines 5 to 9.
Lemmas 2.3.3 and 2.3.4 cover also the special case of an early exit in Line 3. Finally,
we know that the algorithm must terminate if the query graph is finite through Lemma
2.3.1.
35
Correctness of M
IN
C
UT
AG
A
T
Lemma 2.3.5. Algorithm M
IN
C
UT
AG
A
T terminates if G
= (V, E) is a connected
and finite query graph and G
|S
with S
V a connected (sub)graph.
Proof. The number of possibilities how the set C can be enlarged in every new invo-
cation is limited by the number of neighbors that are not element of X. The maximal
number of neighbors is
|S \ C|. Since |V | is limited, |S| is limited and hence, the
number of choices is limited as well. With each recursive invocation,
|C| grows by 1,
and the complement
|S \ C| decreases by 1. Therefore, the recursion depth of M
IN
-
C
UT
AG
A
T is limited by
|S|, and the maximal number of choices is asymptotic with
an increasing size of C. Because the number of choices is finite, every recursive call
must return. This means that the root invocation also has to return and M
IN
C
UT
AG
A
T
terminates.
Lemma 2.3.6. Let G
= (V, E) be a connected query graph and G
|S
with S
V
a connected (sub)graph, then M
IN
C
UT
AG
A
T enumerates only connected sets and
assigns them to C where C
S holds.
Proof. By induction over the recursion depth n.
Base case: n
= 0
M
IN
C
UT
AG
A
T starts the enumeration with an arbitrary vertex t, which induces a con-
nected subgraph G
|{t}
.
Induction hypothesis: recursion depth n enumerates only connected sets of C
{v}
and passes them as parameters to recursion depth n
+ 1.
Induction step: n
n + 1
M
IN
C
UT
AG
A
T at recursion level n
+ 1 is called with a connected set C (IH) and
considers only vertices that are connected to vertices in C (Definition 2.1.10, Line 8).
As any vertex in
N (C) is directly connected to at least one vertex in C, any vertex v
can be added to C to form a connected set C
{v} (Line 9).
Lemma 2.3.7. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph, then M
IN
C
UT
AG
A
T enumerates only ccps for S.
Proof. Because of Lemma 2.3.6, every enumerated set C has to be connected. Since
S
\ C is the complement of C in S, it is disjoint to C, and C (S \ C) = S holds.
Because of Theorem 2.3.1, every set V
\ S that passes the test for connectivity must
be connected. Since only connected query graphs G
= (V, E) or subgraphs G
|S
are
allowed as input, there must exist at least one edge
(v
1
, v
2
) with v
1
C v
2
(S\C).
Hence, according to Definition 2.1.24, every emitted pair
(C, S \ C) must be a ccp for
S
.
Lemma 2.3.8. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. Further, let v
S be a vertex, n a natural number with n 0
with C
n
=
0in
N
i
({v}). Then G
|C
n
must be a connected subgraph of G
|S
.
36
Proof. By induction over n.
Base case: n
= 0
C
0
= N
0
({v}) = {v} (Definition 2.1.11). Thus, G
C
0
is a connected subgraph of G
|S
.
Induction hypothesis: G
C
n
is a connected subgraph of G
|S
for a given, fixed n.
Induction step: n
n + 1
For C
n
+1
holds: C
n
+1
= C
n
N
n
+1
({v}). We know G
C
n
is a connected subgraph of
G
|S
(IH), and because
N
n
({v}) C
n
, all vertices in
N
n
+1
({v}) have to be connected
to at least one vertex in C
n
. It follows that G
|C
n+1
is a connected subgraph of G
|S
.
Lemma 2.3.9. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph and v be an arbitrary vertex with v
S. Then the following
holds:
n 0 such that
0in
N
i
({v}) = and
i>n
N
i
({v}) = .
Proof. According to Definition 2.1.11 for a given
N
i
({v}) = , N
i
+1
({v}) =
follows. Besides, since
N
0
({v}) = ,
i
N
i
({v}) V , and
j<i
N
i
({v})N
j
({v}) =
holds, N
|V |
({v}) = for a n [0, |V |[ must hold as well.
Lemma 2.3.10. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. If G
|C
with
|C| > 1 C S is a connected subgraph of G
|S
,
then
v C such that G
|C\{v}
is a connected subgraph of G
|C
.
Proof. In the following, we consider G
C
as connected and as the basis for computing
N ({v}) and N
i
({v}), i.e., we consider only the reduced edge set E
C
with E
C
=
{(v
0
, v
1
) | (v
0
, v
1
) E v
0
, v
1
C}. Let us determine for an arbitrary v
i
C
a natural number n such that
N
n
(v
i
) = N
n
+1
(v
i
) = holds. According to
Lemma 2.3.9, we always can find such a number with n >
0 and
0in
N
i
(v
i
) = C
holds. Following from Lemma 2.3.8,
0i<n
N
i
({v
i
}) induces a connected subgraph
of G
|C
. Furthermore, all vertices in
N
n
({v
i
}) are connected to at least one vertex in
N
n
-1
({v
i
}). This implies that any v
j
N
n
({v
i
}) can be removed. Since n > 0
and
N
n
({v
i
}) = holds, it follows that for any v
j
N
n
({v
i
}) the corresponding
subgraph G
|S\{v
j
}
) has to be a connected subgraph of G
|C
.
Lemma 2.3.11. Algorithm M
IN
C
UT
AG
A
T enumerates only one set C that consists of
a single vertex. This instance of C must be connected.
Proof. The root invocation of M
IN
C
UT
AG
A
T is called with C
= {t}, where t S
holds. This is the only time that C contains just one element. Any recursive self
invocation adds another v
S to the C previously handed over. Since any one-
element vertex set is connected, the only instance of C that contains only one vertex
must be connected as well.
Lemma 2.3.12. In every invocation of M
IN
C
UT
AG
A
T, C
X = holds.
Proof. By induction over the recursion depth n.
37
Base Case: n
= 0
In the root invocation, C
= {t} with t S and X = holds. Thus, C T = holds.
Induction hypothesis: C and X are disjoint for a given, fixed recursion depth n.
Induction step: n
n + 1
We only have to show that in every iteration of Line 9 the sets C
{v} and X are
disjoint. For every value of v (Line 8) v
C and v X holds. Since C X =
(IH) and X
(X N (C)) holds, we can conclude that for all possible values of X
during the loop in Lines 8 to 10, X
C = must hold. This is because the new v is
added to X in Line 10, i.e. C
{v} and X must still be disjoint in Line 9.
Lemma 2.3.13. Algorithm M
IN
C
UT
AG
A
T enumerates all sets C contained in the
powerset of S that are connected and that contain the start vertex t.
Proof. By contradiction. We assume that not all connected sets containing t are enu-
merated and assigned to C. Thus,
C S S = such that G
|C
is a connected
subgraph, and C is not enumerated. If several such C exist, we choose C such that
|C| is minimal. Lemma 2.3.11 implies that |C| > 1. Lemma 2.3.10 implies that
v C : G
C
\{v}
is a connected subgraph. As C is chosen to be minimal, the set
C
= C \ {v} is enumerated.
Case 1: v appears in
N (C ) \ X during the enumeration of C \ {v }. This is a
contradiction to the assumption that C was not enumerated (Line 9).
Case 2: v does not appear in
N (C ) \ X during the enumeration of C . Since v
is connected to C , it must have been excluded, i.e. v
X. We know that in one of
the parent or ancestor invocations of M
IN
C
UT
AG
A
T with a C
C and X X
with v
N (C ), v has been added to X . But right before v has been added to X ,
M
IN
C
UT
AG
A
T must have been invoked with C
{v}. That means that in one of
those child invocations C has already been enumerated because according to Lemma
2.3.12 C
(X \ {v}) = must hold.
Lemma 2.3.14. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. Algorithm M
IN
C
UT
AG
A
T enumerates all ccps
(C, S \ C) for
S where in all sets C the start vertex t is contained.
Proof. According to Lemma 2.3.7, M
IN
C
UT
AG
A
T enumerates only ccps for C. Be-
cause of Lemma 2.3.13, all possible connected sets C that contain the start vertex t
are enumerated. Thus, Algorithm M
IN
C
UT
AG
A
T must enumerate all possible ccps
(C, S \ C) for S where the condition t C holds.
Lemma 2.3.15. Let G
= (V, E) be a connected query graph and G
|S
with S
V
a connected (sub)graph. Algorithm M
IN
C
UT
AG
A
T enumerates all connected sets C
that belong to the power set of S containing the start vertex t
S only once.
Proof. By contradiction. We assume that
C S, which is enumerated at least twice.
If multiple such C exist, we choose C such that
|C| is minimal.
Case 1:
|C| = 1. As discussed in Lemma 2.3.11, M
IN
C
UT
AG
A
T enumerates only
one set consisting of a single vertex in the root invocation. Thus,
|C| = 1 is not
produced twice, as it is not enumerated in any recursive self invocation.
38
Case 2:
|C| > 1. C cannot be enumerated by two different invocations of M
IN
C
UT
A-
G
A
T with the same parameters, as
|C| is minimal (otherwise, a smaller |C| must exist
that also has to be enumerated twice). Thus, there must be two different enumeration
paths producing C. As both paths start with the same vertex t, they are identical at
the beginning (Lemma 2.3.11) in enlarging a set C where C
C holds. At some
point, the common path must fork and two vertices v
1
, v
2
N (C ) with v
1
= v
2
and v
1
, v
2
C must exist. Thus, there are two child invocations, the first one with
C
{v
1
} for the parameter C, and the second one with C {v
2
} for the parameter C.
They cannot both have identical sets for the parameter X, because for the later child
invocation,
{v
1
} is added. But since v
1
C , the vertex must be added later during the
recursive descent, but this is impossible, since v
1
is already included in the set X.
Lemma 2.3.16. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. Algorithm M
IN
C
UT
AG
A
T enumerates all ccps
(C, S \ C) for
S where all sets C contain the start vertex t only once.
Proof. As Lemma 2.3.15 clarifies, all connected sets C containing the start vertex t are
enumerated only once. Since the set S
\C is the complement of C in S and there exists
just one complement per set C in S, M
IN
C
UT
AG
A
T enumerates all ccps
(C, S \ C)
for S only once and t
C always holds.
Lemma 2.3.17. Let G
= (V, E) be a connected query graph and G
|S
with S
V a
connected (sub)graph. Let a partition
(C, S \ C) be a ccp with an arbitrary C such
that C
S holds. Then, M
IN
C
UT
AG
A
T emits either
(C, S \ C) or its symmetric
counter pair
(S \ C, C), but never both ccps.
Proof. Algorithm M
IN
C
UT
AG
A
T emits the ccps always in the form of
(C, S \ C) but
never in the form of
(S \ C, C) (Line 2 of M
IN
C
UT
AG
A
T). Since the start vertex t is
always part of C and never element of the complement S
\ C from two possible sym-
metric pairs, only one ccp for S is emitted. Because Lemma 2.3.14 and Lemma 2.3.16
hold, M
IN
C
UT
AG
A
T enumerates all symmetric ccps for S only once.
Theorem 2.3.2. Algorithm M
IN
C
UT
AG
A
T is correct.
Proof. The theorem follows immediately from Lemma 2.3.5, Lemma 2.3.7,
Lemma 2.3.14, Lemma 2.3.15, and Lemma 2.3.17.
2.4. Conservative Graph Partitioning
This section presents a partitioning algorithm named conservative partitioning, which
we denote by M
IN
C
UT
C
ONSERVATIVE
[12]. It is an improvement of the advanced
generate-and-test approach presented in Section 2.3. The algorithm emits all ccps for
a connected vertex set S for which S
V holds and where V is the vertex set of the
query graph G
= (V, E). We denote the instantiated top-down memoization variant
(Section 2.2.1) by TDM
IN
C
UT
C
ONSERVATIVE
or TDM
C
C for short. Its pseudo-code
is given in Figure 2.10.
Conservative partitioning is invoked by P
ARTITION
M inCutConservative
, which in
turn immediately calls its recursive component M
IN
C
UT
C
ONSERVATIVE
. The basic
39
idea of conservative partitioning is similar to the advanced-generate-and-test approach:
to successively enhance a connected set C by members of its neighborhood
N (C) at
every recursive iteration. For query graphs with many biconnected components, M
IN
-
C
UT
AG
A
T does not perform optimally. The reason is that enhancing C in a way that
every graph cut is minimal cannot always be done by just removing one vertex v from
S
\ C. As mentioned in Section 2.3, this can happen when v is an articulation vertex
a
. In the majority of these cases, M
IN
C
UT
AG
A
T is not able to emit a ccp in every
iteration, which increases the amortized costs per minimal cut. M
IN
C
UT
C
ONSERVA
-
TIVE
is designed to overcome this disadvantage with a conservative approach. In fact,
conservative partitioning emits a ccp at the start of every invocation except for the root
invocation.
Like M
IN
C
UT
AG
A
T, the process of enhancing C starts with a single vertex t
S.
For M
IN
C
UT
C
ONSERVATIVE
, this is done through a redefinition of
N () = {t}.
M
IN
C
UT
C
ONSERVATIVE
ensures that while C is enlarged it remains connected at any
time. Since S and C
S are connected, for every possible complement S \ C there
must exist a join edge
(v
1
, v
2
), where v
1
C and v
2
(S \ C) holds. If at some
point of enlarging C its complement S
\C in S is connected as well, the algorithm has
found a ccp for S.
2.4.1. Correctness Constraints
Besides, the connectivity of C's complement conservative partitioning has to meet
some more constraints before emitting a ccp: (1) Symmetric ccps are emitted once, (2)
the emission of duplicates has to be avoided, and (3) all ccps for S have to be computed
as long as they comply with constraint (1).
Constraint (1) is ensured because the start vertex t is always contained in C and,
therefore, can never be part of its complement S
\ C. For the second constraint, the
algorithm uses a filter set X of neighbors to exclude from processing. And for con-
straint (3), it is sufficient to ensure that all possible connected subsets of S that have a
connected complement S
\ C are considered when enlarging C.
2.4.2. The Algorithm in Detail
There are certain scenarios, e.g., when star queries are considered, where constructing
every possible connected subset C of S produces an exponential overhead because
most of the complements S
\ C are not connected and the partitions (C, S \ C) com-
puted this way are not valid ccps. Therefore, the algorithm follows a conservative
approach by enhancing C in such a way that the complement must be connected as
well. Before we explain this technique, we have to make some observations. From
the recursive process of enlarging C, we know that the number of members in C must
increase by at least one in every iteration. Furthermore, if a partition
(C, S \ C) is not
a ccp for S, then S
\ C consists of k 2 connected subsets O
1
, O
2
, ..., O
k
(S \ C)
that are disjoint and not connected to each other. Hence, those subsets O
1
, O
2
, ...O
k
can only be adjacent to C. Let v
1
, v
2
, ..., v
l
be all the members of C's neighborhood
N (C). Then every O
i
where
1 i k must contain at least one such v
j
where
1 j l and k l holds. The first ccp after recursively enlarging C by members of
S
\ C would be generated when all subsets O
i
with
1 i k but one are joined to C.
40
P
ARTITION
M inCutConservative
(G
|S
)
£ Input: a connected (sub)graph G
|S
, arbitrary vertex t
S
£ Output: P
sym
ccp
(S)
1
M
IN
C
UT
C
ONSERVATIVE
(S, , )
M
IN
C
UT
C
ONSERVATIVE
(G
|S
, C, X
)
£ Input: a connected (sub)graph G
|S
, C
X =
£ Output: ccps for S
1
if C
= S C X =
2
return
3
if C
=
4
emit
(C, S \ C)
5
X
X
6
for v
((N (C) S) \ X)
7
O
=
GET
C
ONNECTED
C
OMPONENTS
(G
|S
, C
{v}, {v})
8
for all O
i
O
9
M
IN
C
UT
C
ONSERVATIVE
(G
|S
, S
\ O
i
, X
)
10
X
X {v}
£ N () = {arbitrary element of t S}
Figure 2.10.: Pseudocode for M
IN
C
UT
C
ONSERVATIVE
Hence, in order to ensure that at every recursive iteration of M
IN
C
UT
C
ON
-
SERVATIVE
the complement S
\ C is connected as well, it does not always suffice to
enlarge C by only one of its neighbors but by a larger subset
i
=h
O
i
with
1 h k
of its direct and indirect neighborhood. For the computation of all subsets O
i
with
1 i k, M
IN
C
UT
C
ONSERVATIVE
invokes
GET
C
ONNECTED
C
OMPONENTS
in
Line 7, which calculates an output set O containing all subsets O
i
. If the complement
S
\ C is connected, the returned O contains only one O
i
with O
i
= O
1
= S \ C.
Section 2.4.3 explains
GET
C
ONNECTED
C
OMPONENTS
in detail.
Once O
= {O
1
, O
2
, ..., O
k
} is returned, M
IN
C
UT
C
ONSERVATIVE
invokes itself for
k
different times with a corresponding new C
= S \ O
i
, while
1 i k holds. Note
that if the old S
\ C is connected, there is only one branch of recursions with a new
C
= S \ O
1
= C {v}. The ccp corresponding to S \ O
i
is emitted in Line 4 during
the corresponding child invocation. The recursive descent stops once the last neighbor
v
N (C) was added to C so that for the successive child invocation the condition of
Line 1 holds.
As mentioned for meeting constraint (2), conservative partitioning makes use of the
filter set X. Therefore, the current v is added to X (Line 10) after M
IN
C
UT
C
ONSER
-
VATIVE
has returned from the k recursive calls of Line 9. This ensures that in all other
recursive descents invoked with the remaining v
N (C) yet to be processed, the cur-
rent v cannot be chosen as a neighbor again and is excluded from further consideration
(Line 6).
41
2.4.3. Exploring Connected Components
This section explains how
GET
C
ONNECTED
C
OMPONENTS
works. For performance
reasons, the computation of the connected components is implemented as a twofold
strategy. The first part of
GET
C
ONNECTED
C
OMPONENTS
adopts the improved con-
nection test
IS
C
ONNECTED
I
MP
(Section 2.3.2), and the second part is only executed
if the connection test fails. The pseudo code is given in Figure 2.11.
When comparing the first part of
GET
C
ONNECTED
C
OMPONENTS
(Lines 1 to 12)
to
IS
C
ONNECTED
I
MP
, the only difference can be found in Lines 3, 11 and 12. This is
because
IS
C
ONNECTED
I
MP
returns a Boolean result and
GET
C
ONNECTED
C
OMPO
-
NENTS
a set O of sets O
i
. In case that S
\C is connected, we have to return C s whole
complement as the only O
i
either in Line 3 or in Line 11. For a detailed explanation
of Lines 1 to 12, we refer to Section 2.3.2.
In case S
\ C is not connected, we are not able to reach all neighbors of T that
are elements of S
\ C. Therefore, the condition in Line 10 will not evaluate to
TRUE
.
Hence, L equals our first O
i
(Line 12). For the computation of the remaining O
j
, we
have to execute the second part of
GET
C
ONNECTED
C
OMPONENTS
.
Therefore, we introduce a set U and assign it with all neighbours of T that are not
in C and are not element of L (Line 13). We use U to indicate whether there is any
other O
j
left to compute. This way, U serves as an abort criteria for the loop of Lines
14 to 22. The inner loop of Lines 17 to 20 resembles the loop of our first part (Lines 6
to 9). But this time we cannot implement the second stop condition, as we have done
in Line 6 by checking N
L . The second condition was added as an optimization
to discover as early as possible that all neighbors of T
= {v} are connected to each
other. This cannot be done here, since the sole purpose of this inner loop is to compute
the remaining O
j
.
2.5. Branch Partitioning
With the advanced-generate-and-test approach, we improved the naive portioning al-
gorithm by successively enlarging a connected set C. But for graphs with many bi-
connected components, this technique alone was not efficient enough, since in several
invocations of M
IN
C
UT
AG
A
T no ccps could be emitted. This is because C enlarged
only by one vertex at a time does not always result in a connected complement. As a
result, we proposed the conservative partitioning approach. Therefore, we transformed
the connection test into a computation of the connected and maximally enlarged sets
of C's complement. This section presents branch partitioning, a further improved par-
titioning strategy where the effort of discovering the connected sets O
i
within C's
complement are avoided.
2.5.1. Branch Partitioning - An Overview
We denote the branch partitioning algorithm by M
IN
C
UT
B
RANCH
. The algorithm is
invoked by TDPGS
UB
to compute for a given vertex set S of a connected (sub)graph
G
|S
all possible partitions into two disjoint interconnected sets
(S
1
, S
2
) that are ccps
for S. The output of branch partitioning is the set P
sym
ccp
(S) so that symmetric ccps are
emitted only once. In the Figures 2.12 and 2.13, we give the algorithm's pseudocode
42
GET
C
ONNECTED
C
OMPONENTS
(G
|S
, C, T
)
£ Input: (sub)graph G
|S
, C
S, T C
£ Output: set O of connected disjoint sets O
i
O
i
,O
j
O,i=j
O
i
(S \ C) O
i
O
j
=
1
N
(N (T ) S) \ C
2
if
|N| 1
3
return
{S \ C}
4
L
5
L
{n} : n N
6
while L
= L N L
7
D
(L \ L)
8
L
L
9
L
L N (D) \ C
10
if N
L
11
return
{S \ C}
12
else O
{L}
£ now explore all other to C adjacent subsets
13
U
N \ L
14
while U
=
15
L
16
L
{n} : n U
17
while L
= L
18
D
(L \ L)
19
L
L
20
L
L N (D) \ C
21
U
U \ L
22
O
{L}
23
return O
Figure 2.11.: Pseudocode for
GET
C
ONNECTED
C
OMPONENTS
with P
ARTITION
M inCutBranch
and M
IN
C
UT
B
RANCH
. We call the instantiated gener-
ic memoization variant TDM
IN
C
UT
B
RANCH
, with a TD as a prefix to indicate the
top-down join enumeration algorithm that is based on branch partitioning.
The algorithm's approach adopts the idea of M
IN
C
UT
AG
A
T: to recursively enlarge
a set C by members of its neighborhood
N (C), starting with a single vertex t S.
This way, we ensure that at every instance of the algorithm's execution C is connected.
If at some point of enlarging C its complement S
\ C in S is connected as well, the
algorithm has found a ccp for S. Besides, the connectivity of C's complement branch
partitioning has to meet some more constraints before emitting a ccp: (1) Symmetric
ccp
s are emitted once, (2) the emission of duplicates has to be avoided, and (3) all ccps
for S have to be computed as long as they comply with constraint (1).
Constraint (1) is ensured because the start vertex t - arbitrarily chosen during the
initialization of the partitioning algorithm in Line 1 of P
ARTITION
M inCutBranch
- is
always contained in C and, therefore, can never be part of its complement. For the
second constraint, the algorithm uses a filter set X of neighbors to exclude from pro-
43
cessing. After every recursive self-invocation of the algorithm, the neighbor v
N (C)
that was used to enlarge C is added to X. Later, we will see in detail how this works.
For constraint (3), it is sufficient to ensure that all possible connected subsets of S are
considered when enlarging C.
Checking for the connectivity of the complement set adds a linear overhead per test.
Furthermore, there are certain scenarios, e.g., when star queries are considered, where
constructing every possible connected subset C of S produces an exponential overhead
because most of the complements S
\C are not connected and the partitions (C, S \C)
computed this way are not valid ccps. For branch partitioning, we propose a novel
technique which ensures that no partitions are generated that are not a ccp at the same
time. As a positive side effect, the additional check for connectivity or the discovery
of connected components as needed by M
IN
C
UT
AG
A
T and M
IN
C
UT
C
ONSERVATIVE
can be eliminated.
Before we explain our technique, we repeat important observations of Section 2.4.2.
From the recursive process of enlarging C, we know that the number of members in C
must increase by one in every iteration. Furthermore, if a partition
(C, S \ C) is not a
ccp
for S, then S
\C consists of k 2 connected subsets O
1
, O
2
, ..., O
k
(S\C) that
are disjoint and not connected to each other. Hence, those subsets O
1
, O
2
, ...O
k
can
only be adjacent to C. Let v
1
, v
2
, ..., v
l
be all the members of C's neighborhood
N (C).
Then every O
i
with
1 i k must contain at least one such v
j
with
1 j l and
k
l holds. The first ccp after enlarging C by members of S \ C would be generated
when all subsets O
i
with
1 i k but one are joined to C.
Having made these observations, we are ready to explain our basic idea. The key
principle is to exploit information about how S
\ C is connected from all of M
IN
-
C
UT
L
AZY
's child invocations. Therefore, we introduce a new input parameter L and
a result set R. The one-element set L contains the last vertex v that was added to
C
through the parent invocation. The result set R of a child invocation contains the
maximally enlarged and connected set O
i
such that L
R holds. We compute R by
combining the result sets R
tmp
from the child invocations with L. But we have to be
careful to include only those R
tmp
that are adjacent to L. Hence, we need to distin-
guish between
N (L) and (N (C) \ N (L)): only those R
tmp
can be joined to R where
N (L) R
tmp
= holds.
To make use of the connected sets R
tmp
that are adjacent to v, we postpone the
emission of ccps towards the end. Instead of enlarging C with all but one R
tmp
when
the complement S
\ C is not connected, we introduce an optimization which simply
emits
(S \ R
tmp
, R
tmp
) right away. Note that if S \ C is connected, there exists only
one R
tmp
with R
tmp
= S \ C and (S \ R
tmp
, R
tmp
) = (C, S \ C) holds. In case
S
\C is not connected, C (S \R
tmp
) must hold. We have said that due to constraint
(3), all connected subsets of S have to be considered as values for the set C. Through
the optimization certain connected sets S
\ R
tmp
are skipped. Because we avoid only
those S
\ R
tmp
where the complement S
\ (S \ R
tmp
) = R
tmp
is not a connected set,
our optimization is still sufficient to meet constraint (3).
2.5.2. The Algorithm in Detail
In the following, we take a closer look at the pseudocode given in Figures 2.12 and
2.13. P
ARTITION
M inCutBranch
calls M
IN
C
UT
B
RANCH
the first time with C
= L =
44
Details
- Pages
- Type of Edition
- Erstausgabe
- Publication Year
- 2014
- ISBN (eBook)
- 9783954898367
- ISBN (Softcover)
- 9783954893362
- File size
- 2.4 MB
- Language
- English
- Publication date
- 2014 (December)
- Keywords
- efficient memoization algorithms query optimization top-down join enumeration basis hypergraphs
- Product Safety
- Anchor Academic Publishing