|
Contents |
7 |
|
|
Foreword |
13 |
|
|
Preface |
15 |
|
|
Acknowledgments |
17 |
|
|
PART I METRIC SEARCHING IN A NUTSHELL |
18 |
|
|
Chapter 1 FOUNDATIONS OF METRIC SPACE SEARCHING |
22 |
|
|
1. The Distance Searching Problem |
23 |
|
|
2. The Metric Space |
25 |
|
|
3. Distance Measures |
26 |
|
|
3.1 Minkowski Distances |
27 |
|
|
3.2 Quadratic Form Distance |
28 |
|
|
3.3 Edit Distance |
29 |
|
|
3.4 Tree Edit Distance |
30 |
|
|
3.5 Jaccard's Coefficient |
30 |
|
|
3.6 Hausdorff Distance |
31 |
|
|
3.7 Time Complexity |
31 |
|
|
4. Similarity Queries |
32 |
|
|
4.1 Range Query |
32 |
|
|
4.2 Nearest Neighbor Query |
33 |
|
|
4.3 Reverse Nearest Neighbor Query |
34 |
|
|
4.4 Similarity Join |
34 |
|
|
4.5 Combinations of Queries |
35 |
|
|
4.6 Complex Similarity Queries |
35 |
|
|
5. Basic Partitioning Principles |
37 |
|
|
5.1 Ball Partitioning |
37 |
|
|
5.2 Generalized Hyperplane Partitioning |
38 |
|
|
5.3 Excluded Middle Partitioning |
38 |
|
|
6. Principles of Similarity Query Execution |
39 |
|
|
6.1 Basic Strategies |
39 |
|
|
6.2 Incremental Similarity Search |
42 |
|
|
7. Policies for Avoiding Distance Computations |
43 |
|
|
7.1 Explanatory Example |
44 |
|
|
7.2 Object-Pivot Distance Constraint |
45 |
|
|
7.3 Range-Pivot Distance Constraint |
47 |
|
|
7.4 Pivot-Pivot Distance Constraint |
48 |
|
|
7.5 Double-Pivot Distance Constraint |
50 |
|
|
7.6 Pivot Filtering |
51 |
|
|
8. Metric Space Transformations |
52 |
|
|
8.1 Metric Hierarchies |
53 |
|
|
8.1.1 Lower-Bounding Functions |
53 |
|
|
8.2 User-Defined Metric Functions |
55 |
|
|
8.2.1 Searching Using Lower-Bounding Functions |
55 |
|
|
8.3 Embedding Metric Space |
56 |
|
|
8.3.1 Embedding Examples |
56 |
|
|
8.3.2 Reducing Dimensionality |
57 |
|
|
9. Approximate Similarity Search |
58 |
|
|
9.1 Principles |
58 |
|
|
9.2 Generic Algorithms |
61 |
|
|
9.3 Measures of Performance |
63 |
|
|
9.3.1 Improvement in Efficiency |
63 |
|
|
9.3.2 Precision and Recall |
63 |
|
|
9.3.3 Relative Error on Distances |
65 |
|
|
9.3.4 Position Error |
66 |
|
|
10. Advanced Issues |
67 |
|
|
10.1 Statistics on Metric Datasets |
68 |
|
|
10.1.1 Distribution and Density Functions |
68 |
|
|
10.1.2 Distance Distribution and Density |
69 |
|
|
10.1.3 Homogeneity of Viewpoints |
71 |
|
|
10.2 Proximity of Ball Regions |
72 |
|
|
10.3 Performance Prediction |
75 |
|
|
10.4 Tree Quality IMeasures |
77 |
|
|
10.5 Choosing Reference Points |
80 |
|
|
Chapter 2 SURVEY OF EXISTING APPROACHES |
84 |
|
|
1. Ball Partitioning IMethods |
84 |
|
|
1.1 Burkhard- Keller Tree |
85 |
|
|
1.2 Fixed Queries Tree |
86 |
|
|
1.3 Fixed Queries Array |
87 |
|
|
1.4 Vantage Point Tree |
89 |
|
|
1.5 Excluded Middle Vantage Point Forest |
92 |
|
|
2. Generalized Hyperplane Partitioning Approaches |
93 |
|
|
2.1 Bisector Tree |
93 |
|
|
2.2 Generalized Hyperplane Tree |
94 |
|
|
3. Exploiting Pre- Computed Distances |
95 |
|
|
3.1 AESA |
95 |
|
|
3.2 Linear AESA |
96 |
|
|
3.3 Other IMethods |
97 |
|
|
4. Hybrid Indexing Approaclies |
98 |
|
|
4.1 Multi Vantage Point Tree |
98 |
|
|
4.2 Geometric Near-neighbor Access Tree |
99 |
|
|
4.3 Spatial Approximation Tree |
102 |
|
|
4.4 M-tree |
104 |
|
|
4.5 Similarity Hashing |
105 |
|
|
5. Approximate Similarity Search |
106 |
|
|
5.1 Exploiting Space Transformations |
106 |
|
|
5.2 Approximate Nearest Neighbors with BBD Trees |
107 |
|
|
5.3 Angle Property Technique |
109 |
|
|
5.4 Clustering for Indexing |
111 |
|
|
5.5 Vector Quantization Index |
112 |
|
|
5.6 Buoy Indexing |
114 |
|
|
5.7 Hierarchical Decomposition of IMetric Spaces |
114 |
|
|
5.7.1 Relative Error Approximation |
115 |
|
|
5.7.2 Good Fraction Approximation |
115 |
|
|
5.7.3 Small Chance Improvement Approximation |
115 |
|
|
5.7.4 Proximity-Based Approximation |
116 |
|
|
5.7.5 PAC Nearest Neighbor Search |
116 |
|
|
PART II METRIC SEARCHING IN LARGE COLLECTIONS OF DATA |
118 |
|
|
Chapter 3 CENTRALIZED INDEX STRUCTURES |
122 |
|
|
1. M-tree Family |
122 |
|
|
1.1 The M-tree |
122 |
|
|
1.2 Bulk-Loading Algorithm of IVl-tree |
126 |
|
|
1.3 Multi-Way Insertion Algorithm |
129 |
|
|
1.4 The Slim Tree |
130 |
|
|
1.4.1 Slim-Down Algorithm |
131 |
|
|
1.4.2 Generalized Slim-Down Algorithm |
133 |
|
|
1.5 Pivoting M-tree |
135 |
|
|
1.6 The M+-tree |
138 |
|
|
1.7 The M2 -tree |
141 |
|
|
2. Hash-based metric indexing |
142 |
|
|
2.1 The D-index |
143 |
|
|
2.1.1 Insertion and Search Strategies |
146 |
|
|
2.2 The eD-index |
148 |
|
|
2.2.1 Similarity Self-Join Algorithm with eD-index |
150 |
|
|
3. Performance Trials |
153 |
|
|
3.1 Datasets and Distance Measures |
154 |
|
|
3.2 Performance Comparison |
155 |
|
|
3.3 Different Query Types |
157 |
|
|
3.4 Scalability |
158 |
|
|
Chapter 4 APPROXIMATE SIMILARITY SEARCH |
162 |
|
|
1. Relative Error Approximation |
162 |
|
|
2. Good Fraction Approximation |
165 |
|
|
3. Small Chance Improvement Approximation |
167 |
|
|
4. Proximity-Based Approximation |
169 |
|
|
5. PAC Nearest Neighbor Searching |
170 |
|
|
6. Performance Trials |
171 |
|
|
6.1 Range Queries |
172 |
|
|
6.2 Nearest Neighbors Queries |
173 |
|
|
6.3 Global Considerations |
176 |
|
|
Chapter 5 PARALLEL AND DISTRIBUTED INDEXES |
178 |
|
|
1. Preliminaries |
178 |
|
|
1.1 Parallel Computing |
179 |
|
|
1.2 Distributed Computing |
180 |
|
|
1.2.1 Scalable and Distributed Data Structures |
180 |
|
|
1.2.2 Peer-to-Peer Data Networks |
181 |
|
|
2. Processing M-trees with Parallel Resources |
181 |
|
|
2.1 CPU Parallelism |
182 |
|
|
2.2 I/O Parallelism |
182 |
|
|
2.3 Object Declustering in M-trees |
184 |
|
|
3. Scalable Distributed Similarity Search Structure |
184 |
|
|
3.1 Architecture |
185 |
|
|
3.2 Address Search Tree |
186 |
|
|
3.3 Storage Management |
186 |
|
|
3.3.1 Bucket Splitting |
187 |
|
|
3.3.2 Choosing Pivots |
188 |
|
|
3.4 Insertion of Objects |
188 |
|
|
3.5 Range Search |
189 |
|
|
3.6 Nearest Neighbor Search |
190 |
|
|
3.7 Deletions and Updates of Objects |
191 |
|
|
3.8 Image Adjustment |
192 |
|
|
3.9 Logarithmic Replication Strategy |
194 |
|
|
3.10 Joining the Peer-to-Peer Network |
195 |
|
|
3.11 Leaving the Peer-to-Peer Network |
195 |
|
|
4. Performance Trials |
196 |
|
|
4.1 Datasets and Computing Infrastructure |
197 |
|
|
4.2 Performance of Similarity Queries |
197 |
|
|
4.2.1 Global Costs |
198 |
|
|
4.2.3 Comparison of Search Algorithms |
205 |
|
|
4.3 Data Volume Scalability |
206 |
|
|
Concluding Summary |
210 |
|
|
References |
214 |
|
|
Author Index |
228 |
|
|
Index |
232 |
|
|
Abbreviations |
236 |
|