Tuesday, April 26, 2011

The Black Art Of Spatial Index Tuning In SQL Server

For my inaugural blog post I have decided to discuss some investigations I have been making recently into tuning spatial indexes in SQL Server.

I have found that while there is a lot of information on teh interwebs about the theory of SQL Server spatial indexes, and the utilities that can be used to tune them, there is very little guidance on what spatial index settings are best suited for different data scenarios, and there is only limited information about the characteristics to consider when tuning these indexes.

As an attempt to provide some guidance I have investigated configuring and tuning spatial indexes on three different spatial data sets which I have discussed as case studies, but before jumping straight into these I think it would be wise to discuss some background aspects to spatial indexing in SQL Server.



SQL Server Spatial Index Theory


I'm not going to get too much into the theory of spatial indexes in SQL server, or the details of creating spatial indexes because there is a lot of documentation on this subject, but I will be referring to aspects of spatial index theory throughout this post.  The best starting point for delving into this area is the SQL Server Spatial Indexing Overview documentation.  Another great source of information is Isaac Kunen's Blog.

A SQL Server spatial index is basically a grid of cells that cover the extent of the spatial data in a table which allows spatial queries to easily filter out prospective records and lookup smaller subsets of records to see if they match the spatial criteria. e.g. If a spatial query is trying to find all the records that intersect a particular rectangle, then a spatial index can improve the performance by only executing the intersection criteria on records that are contained in a grid cell(s) that the criteria rectangle resides in.

Rather than just being a single layer of cells over the extent of the spatial data, the index is actually a layered grid which is four levels deep (see the image from the SQL Server documentation below).  Each cell in level 1 of the grid is divided into cells that make up level 2, and so on for level 3 and 4.



Each level of the grid can be divided into a different number of cells based on a setting of Low (4x4 or 16 cells), Medium (8x8 or 64 cells), or High (16x16 or 256 cells).  This is what controls the granularity of the spatial index.


Geography vs. Geometry


An interesting thing to note about the differences between Geography and Geometry data types is that spatial indexes for Geography types do not have a bounding box specified.  This means that Geography indexes span the entire globe from -180 to 180 longitude and -90 to 90 latitude.  This means that the granularity specified for a Geography index divides the entire globe into the desired number of cells.

Geometry indexes have a bounding box specified, so the administrator can define the grid granularity in conjunction with a known data extent, meaning that the administrator has a lot more control over the equivalent "ground" size of each grid cell.

Because of this limitation I see the Geometry index as having a major advantage over the Geography index.  I'm not entirely sure of the reason for this constraint on Geography indexes, perhaps there is an assumption that data sets using geographic coordinates (longitude, latitude) are usually large scale data sets spanning large extents around the earth, but in practice I don't think this is always the case, which will become evident in my last case study.


Index Distribution


In SQL Server a primary key index has high selectivity because each indexed value is unique, which means that performance is relatively similar across the distribution of keys in the index regardless of the selection criteria.

It is a different story with non-unique indexes, because there may not be the same level of selectivity across the entire distribution of values in the index, e.g. A column containing surnames of users would undoubtedly have low selectivity when searching for names such as "smith" or "jones", but potentially have higher selectivity when searching for names such as "kerouac" or "zimmerman".  These indexes show us that the performance of the index has a high correlation with the criteria we are using to search with. 

This same consideration needs to be applied when we create spatial indexes.  When calculating the desired settings for a spatial index, not only is it important to make sure that the index is granular enough to efficiently index our records, it is also important that we have a rough idea on the range of spatial selections that users will be performing on our data set.

Based on the four levels of index grid, and three settings of granularity of each grid, it can be seen that the resulting number of cells in the grid will be the same if the index is defined as High, Medium, Medium, Medium (HMMM) as it would if the index was defined as MMMH, but the difference in performance will be dependent on the spatial criteria being applied, i.e. A large sized rectangular criteria filter may perform better using the MMMH index over the HMMM index because it may be able to select fewer grid cells in level 1 to analyse before drilling into the level 2 grid cells.


Spatial Index Tuning


I found a great source of information on spatial index tuning on Bob Beauchemin's Blog which highlights the following stored procedures for investigating spatial index characteristics:
  •  sp_help_spatial_geography_index
  •  sp_help_spatial_geometry_index

This stored procedures accept as input parameters
  • Table Name
  • Index Name
  • Flag indicating whether to return verbose output
  • A geography/geometry instance representing the selection extent (refer to the index distribution discussion above)

In Bob's Blog he discusses the interesting characteristics being the primary and internal filter efficiency, as well as the statistics on the overall number of rows in the index, broken down as an average number of index rows per row of data.

The primary filter is the spatial index, and the internal filter seems to be an optimised version of the spatial index.  The spatial filter's purpose is to reduce the number of times the query's spatial criteria is compared against records in the table, by executing it against cells in the index and looking up the data that resides in these cells.

In this post on Bob Beauchemin's Blog he discusses the output values from the tuning stored procedures, showing that the primary index alone will result in selection of a records that may meet the spatial criteria, including a number of false positives, while the optimised internal filter will perform a conservative selection of the bare minimum of records that meet the spatial criteria, and it is the difference between these selections that the query engine will use to compare the spatial criteria against the data records to determine whether they meet the query criteria.

So the most efficient spatial indexes will have a large percentage of the records being determined from the primary and internal filters (i.e. high efficiency percentages), with only a small number of surplus records needing to be checked individually.

Keep in mind that if we make the index quite granular, i.e. have a large number of small grid cells, we may be able to improve the efficiency percentage of the primary and internal filters, but the index itself will need to analyse a larger number of grid cells to see if those match the query criteria, and once the matching cells are determined, the index will possibly need to look-up the data records from more index leaf nodes.

From these observations we can see that the perfect spatial index will be defined such that the number of grid cells in the index is minimised, while still providing a granularity of the index grid to allow selections to be made across the entire data set, in such a way that the index filter efficiency will be maximised across the broad range of spatial queries being performed.

Wow, that's a mouthful, breaking that down, basically we want the index to be as efficient as possible while trying to minimise the number of rows in the index.


Index Granularity


As discussed in the Index Distribution section above, the main consideration for the spatial index settings comes down to having a good idea of the types of spatial queries that will be getting performed.  e.g. If the majority of spatial selections are for view extents, and they are mostly higher scale selections, then it might be appropriate to have an index grid density that is relatively sparse compared to the spatial density of the data in the table.  If on the other hand you are doing a lot of selections of small numbers of records, perhaps within a buffer of another feature, then you may want to have an index grid density that is fairly similar to the density of your data.

This leads to us needing to know a little more about the statistics of our data.  Typical indexes in SQL Server use statistics in their indexes to determine how selective particular data values are across the range of data in the table, so in a sense we are doing something similar in getting to know the distribution of our data to make an informed decision on the settings for our spatial index.

In my case studies below I performed spatial analysis of my data sets to determine the maximum density of data, and made decisions on my spatial indexes using that information.  I did this using a T-SQL script that splits the data extent into a number of blocks, and counts the number of records in each block looking for the maximum number.  It then calculates a relative density compared to the extent size using the number of records in the block with the maximum number.


Spatial Index Hints


Just briefly I wanted to mention that when performing my analysis of spatial indexes in SQL Server I found that I had to always define an index hint to force the spatial queries to use the index I had defined.  I have read that this may be due to the version of SQL Server I am using, but I thought it would be useful to make a note of so that anyone having problems with the spatial index not being used can be aware that this might be a cause.

The image below shows my query plan without an index hint, showing that a table scan (in this case the clustered primary key) will be performed to check each record to see if it matches the criteria expression.


The image below shows the same query with an index hint defined, showing part of the query plan that indicates a spatial index seek will be employed.


Case Studies

I have a background in cadastral surveying, so I have used three data sets that each represent a layer of cadastral land boundaries, as land parcel polygons:
  • Data Set 1 - Local Municipality - approximately 15km x  15km, with 14000 records
  • Data Set 2 - Local Municipality - approximately 40km x  40km, with 66000 records
  • Data Set 3 - State - approximately 1700km x  2200km, with 2.5 million records

For each of the data sets I created four separate selections in different locations within the data set, and then wrote a script that would execute the four selections a hundred times and then average the time taken to select the records.  The number of features selected is not exactly the same for each data set, but is relatively similar enough to make comparisons.

For each of the data sets, the scenarios that I had in mind were spatial selections of urban or semi-urban land parcels, so three of the four spatial selections in the tests select a relatively small number of records, i.e. akin to selecting the land parcels in a street, or subdivision etc.  The fourth selection is like a selection for a view extent, of a number of thousand parcels, i.e. around 1:10000 scale.


Data Set 1


This data set represents a small regional local government authority.  The extent of the data is about 15km x 15km (10 miles).  Regardless of the small size of the area, the majority of land parcels are typical residential sized allotments.

The data set was based on a projected map grid datum, so I was able to create each land parcel as a geometry data type.  This meant that I was able to define my spatial indexes with a bounding box, and be in more control of the size of the resulting index grid cells.

The data set has an average parcel size of 10000sqm, with 80% of parcels being under 2000sqm (residential land parcel size), with  the total area of those 80% of parcels being 0.64% of the area of the bounding box of the data set, and being 6.78% of the total area of all parcels.

This gives you an indication of the relative density of the data, and that the majority of the small spatial selections that will be performed on the data will be on parcels that are very small compared to the extent of the data set.

I calculated the highest area of spatial density to be 1:238400 (comparison to the size of the overall extent of the data set), which is a lot higher than the number of records in the data set (approx  14000).

I made a decision that I wanted an index grid that would be similar to the highest record density, so I figured that a grid of MLLL, or 64x16x16x16 (equating to 262144 grid cells) would be the ideal, or a permutation of those settings, i.e. LMLL, LLML, LLLM.  I also executed my test script against other index configurations as well to test my hypothesis, e.g. LLLL, HLLL, etc.

My test selections were performed with an STIntersects criteria expression with different sized rectangles in different locations in the area.  The results being:

Test 1 Count: 56 features
Test 2 Count: 124 features
Test 3 Count: 110 features
Test 4 Count: 7016 features

The results for each index are shown below, with the average time taken to perform each query, and the filter efficiency of each selection, along with the number of rows in the index.

Index Settings
Index Performance
No Index
Test 1
0.1081s
Test 2
0.1091s
Test 3
0.1080s
Test 4
0.1266s
MLLL

Total_Primary_Index_Rows        87980
Total_Primary_Index_Pages        264
Average_Number_Of_Index_Rows_Per_Base_Row        6

Test 1
 0.0049s
Internal_Filter_Efficiency        94.64%
Primary_Filter_Efficiency        67.46%
Test 2
0.0072s
Internal_Filter_Efficiency        82.25%
Primary_Filter_Efficiency        92.53%
Test 3
0.0049s
Internal_Filter_Efficiency        88.18%
Primary_Filter_Efficiency        88.70%
Test 4
0.0433s
Internal_Filter_Efficiency        88.18%
Primary_Filter_Efficiency        88.70%
LMLL

Total_Primary_Index_Rows        87955
Total_Primary_Index_Pages        263
Average_Number_Of_Index_Rows_Per_Base_Row        6

Test 1
0.0039s
Internal_Filter_Efficiency        94.64%
Primary_Filter_Efficiency        67.46%
Test 2
0.0082s
Internal_Filter_Efficiency        82.25%
Primary_Filter_Efficiency        92.53%
Test 3
0.0040s
Internal_Filter_Efficiency        88.18%
Primary_Filter_Efficiency        88.70%
Test 4
0.0417s
Internal_Filter_Efficiency        88.54%
Primary_Filter_Efficiency        98.63%
LLML

Total_Primary_Index_Rows        87020
Total_Primary_Index_Pages        261
Average_Number_Of_Index_Rows_Per_Base_Row        6




Test 1
0.0051s
Internal_Filter_Efficiency        94.64%
Primary_Filter_Efficiency        67.46%
Test 2
0.0074s
Internal_Filter_Efficiency        78.22%
Primary_Filter_Efficiency        89.20%
Test 3
0.0050s
Internal_Filter_Efficiency        87.27%
Primary_Filter_Efficiency        82.08%
Test 4
0.0416s
Internal_Filter_Efficiency        87.10%
Primary_Filter_Efficiency        98.24%
LLLL

Total_Primary_Index_Rows        53354
Total_Primary_Index_Pages        161
Average_Number_Of_Index_Rows_Per_Base_Row        3



Test 1
0.0047s
Internal_Filter_Efficiency        73.21%
Primary_Filter_Efficiency        58.33%
Test 2
0.0070s
Internal_Filter_Efficiency        79.83%
Primary_Filter_Efficiency        93.23%
Test 3
0.0042s
Internal_Filter_Efficiency        79.09%
Primary_Filter_Efficiency        69.18%
Test 4
0.0506s
Internal_Filter_Efficiency        73.07%
Primary_Filter_Efficiency        97.55%
HLLL

Total_Primary_Index_Rows        157928
Total_Primary_Index_Pages        473
Average_Number_Of_Index_Rows_Per_Base_Row        11

Test 1
0.0048s
Internal_Filter_Efficiency        100%
Primary_Filter_Efficiency        71.79%
Test 2
0.0103s
Internal_Filter_Efficiency        85.48%
Primary_Filter_Efficiency        96.12%
Test 3
0.0053s
Internal_Filter_Efficiency        90%
Primary_Filter_Efficiency        94.01%
Test 4
0.0496s
Internal_Filter_Efficiency        95.89%
Primary_Filter_Efficiency        98.40%

The results show that creating an index with a granularity equivalent to the most dense area of my data resulted in the best performance.  The LMLL index provided the best filter efficiency across all four selections, but the timed performance was slightly worse for one of the smaller selections, but this time difference is negligible in the context of the entire query time.

You can see that the HLLL index actually provided a much better primary and internal filter efficiency than the MLLL permutations, but because this index results in twice the number of rows of the MLLL indices, the performance time is longer, as I discussed in the Spatial Index Tuning section.

Conversely, you can see from the LLLL index that the number of index rows is significantly less, but the primary and internal filter efficiency is worse, because the granularity of the grid cells does not support the queries that we are performing, so the query engine is having to resort to comparing more of the data against the spatial criteria, and hence the efficiency is poor.

One thing to note about this data set is that it is so small (14000 records) that the performance with no index is still quite good, with the table scan only taking around 0.1s to check each record in the table to see if it matches the spatial criteria (this is why the time is very similar for each test).  It is worth noting though that by adding the spatial index the large selection query performance is 3 times faster, and the smaller selections are 20 times faster.


Data Set 2


This data set represents a medium sized suburban local government authority.  The extent of the data is about 40km x 40km (25 miles).  The data has a number of medium to high density suburban areas, and there are large portions within the extent of data set that are not utilised as land parcels, leading to the data set having areas of higher density parcels relative to the size of the extent of the data set.

Like the previous data set, this data was based on a projected map grid datum, so I was able to create each land parcel as a geometry data type.  This meant that I was able to define my index with a bounding box, and be in more control of the size of the resulting index grid cells.

This data set has an average parcel size of 8000sqm, with 90% of parcels being under 2000sqm, with the total area of those 90% of parcels being 2.78% of the area of the extents of the data set, and being 7.73% of total area of all parcels.

I calculated the highest area of spatial density to be 1:1189200, which is a lot higher than the number of records in the data set (approx  66000).

Again, basing my decision for the index grid size to be similar to the highest record density, I decided on a grid size of MMLL (64x64x16x16) or HLLL (256x16x16x16) to equate to 1048576 grid cells.

The spatial selection test results are shown below:

Test 1 Count: 82 features
Test 2 Count: 167 features
Test 3 Count: 116 features
Test 4 Count: 6226 features

Index Settings
Index Performance
No Index

Test 1
0.4770s
Test 2
0.4755s
Test 3
0.4784s
Test 4
0.4958s
MMLL

Total_Primary_Index_Rows        274349
Total_Primary_Index_Pages        820
Average_Number_Of_Index_Rows_Per_Base_Row        4




Test 1
0.0046s
Internal_Filter_Efficiency        50%
Primary_Filter_Efficiency        84.53%
Test 2
0.0072s
Internal_Filter_Efficiency        91.01%
Primary_Filter_Efficiency        97.66%
Test 3
0.0048s
Internal_Filter_Efficiency        77.58%
Primary_Filter_Efficiency        76.31%
Test 4
0.0417s
Internal_Filter_Efficiency        84.90%
Primary_Filter_Efficiency        96.94%
MLML

Total_Primary_Index_Rows        273834
Total_Primary_Index_Pages        818
Average_Number_Of_Index_Rows_Per_Base_Row        4



Test 1
0.0043s
Internal_Filter_Efficiency        50%
Primary_Filter_Efficiency        84.53%
Test 2
0.0079s
Internal_Filter_Efficiency        91.01%
Primary_Filter_Efficiency        97.66%
Test 3
0.0047s
Internal_Filter_Efficiency        77.58%
Primary_Filter_Efficiency        76.31%
Test 4
0.0421s
Internal_Filter_Efficiency        84.86%
Primary_Filter_Efficiency        96.90%
HLLL

Total_Primary_Index_Rows        274443
Total_Primary_Index_Pages        820
Average_Number_Of_Index_Rows_Per_Base_Row        4



Test 1
0.0051s
Internal_Filter_Efficiency        50%
Primary_Filter_Efficiency        84.53%
Test 2
0.0076s
Internal_Filter_Efficiency        91.01%
Primary_Filter_Efficiency        97.66%
Test 3
0.0053s
Internal_Filter_Efficiency        77.58%
Primary_Filter_Efficiency        76.31%
Test 4
0.0403s
Internal_Filter_Efficiency        84.90%
Primary_Filter_Efficiency        96.94%
MMML

Total_Primary_Index_Rows        489475
Total_Primary_Index_Pages        1459
Average_Number_Of_Index_Rows_Per_Base_Row        7



Test 1
0.0051s
Internal_Filter_Efficiency        92.41%
Primary_Filter_Efficiency        97.14%
Test 2
0.0100s
Internal_Filter_Efficiency        86.82%
Primary_Filter_Efficiency        95.42%
Test 3
0.0054s
Internal_Filter_Efficiency        82.92%
Primary_Filter_Efficiency        88.17%
Test 4
0.0386s
Internal_Filter_Efficiency        91.37%
Primary_Filter_Efficiency        87.87%

As we can see again, by making the index granularity similar to the density of the most dense data in the data set, we have defined the best performing spatial index - MMLL.  As I mentioned above, the grid cell granularity of MMLL is the same as HLLL.  What is interesting to note is that the primary and internal filter efficiency is exactly the same between MMLL and HLLL, and the number of index rows is very similar as well, but the time taken to perform the larger selections is slightly faster for the HLLL index.  It may be that more of the solution may be getting determined by grid levels 1 or 2 and not having to resort to lower grid levels.  When using the verbose output setting on the stored procedure, there are results such as Total_Number_Of_ObjectCells_In_Level1_In_Index that will give a breakdown of each grid level, and can be analysed to see what the object/index cell distribution is across each of the levels.

One of the interesting results in indexes MMLL and HLLL is the efficiency values in test 1.  The internal and primary filter efficiency is quite poor.  I checked this selection and found that the selection rectangle actually passed through most of the features that were being selected.  This means that a high percentage of features needed to be analysed to see if they fit the criteria.  One thing we may be able to read into this is that the granularity of the index grid may not be small enough, given than our optimal choice in the first data set resulted in an average index rows per data row of 6, and this choice was 4, perhaps we don't have the granularity we need.  Since the selection was so small, and the performance time quite fast anyway, I ignored this result as an outlier, considering it an anomaly of my selection choice.

Another interesting result was the use of the MMML index, which results in twice the number of index rows, but results in an average number of index rows per data row that is closer to our choice from data set 1 - 7.  Looking at the filter efficiency we can see that there has been a big improvement for test 1, although the performance time has not improved.  Again we can see this as the trade-off between efficiency based on grid granularity, and number of index rows that need to be analysed.  If over time we found the MMLL index being inefficient for other selections, we may need to think about using the MMML index, depending on how it was performing.

It is interesting to note with this data set having 66000 records, that the time to perform a table scan has increased significantly on the time encountered in data set 1, with indexed results being between 10 and 100 times faster.


Data Set 3


This data set represents the cadastral boundaries for the Australian state of Queensland.  For overseas readers, you may recall Queensland being in the news earlier in the year due to severe flooding in the south east, and Cyclone Yasi in the north.  Queensland is the 3rd most populated state of Australia, and 2nd in area to Western Australia.  The state is 1700km (1000 miles) wide, by 2200km (1400 miles) in length.  Most of the population is centred around the south east corner of the state (containing the vast majority of land parcels in the state), with a number of populated regional centres located along the coast.




This data set has an average parcel size of 654150sqm, with 80% of parcels being under 2000sqm, with the total area of those 90% of parcels being 0.033% of the area of the extents of the data set, and being 0.071% of total area of all parcels.

To create a spatial index that will cater for the diversity of spatial density for this data set is quite challenging, but as I mentioned in the discussion on Geography vs Geometry, I struck a limitation with this data set that constrained what I could achieve generally.  This data set is in geographic coordinates (longitude, latitude), so I am not able to define a bounding box for the spatial index.  Instead I was only able to try to make the grid as granular as I could to improve the types of selections I desired.

As with the other two data sets, I calculated the highest area of spatial density to be 1:508659200, which is a lot higher than the number of records in the data set (approx  2.5 million),  which if using geometry data I could have created a HHHM (256x256x256x64) 1073741824 grid cell index, but using this granularity for the geography index provided poor efficiency (see below).  I was only able to define one level higher in the number of grid cells - HHHH (256x256x256x256) with 4294967296 grid cells, and while this did improve upon the previous settings, was still quite poor performing in terms of efficiency.

The spatial selection test results are shown below:

Test 1 Count: 79 features
Test 2 Count: 264 features
Test 3 Count: 117 features
Test 4 Count: 6297 features

Index Settings
Index Performance
No Index

Test 1
34.8497s
Test 2
34.9102s
Test 3
34.8761s
Test 4
34.9194s
HHHL

Total_Primary_Index_Rows        3378393
Total_Primary_Index_Pages        10054
Average_Number_Of_Index_Rows_Per_Base_Row        0




Test 1
0.0397s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        3.97%
Test 2
0.0565s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        8.96%
Test 3
0.0343s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        6.92%
Test 4
0.3844s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        31.26%
HHHM

Total_Primary_Index_Rows        3882914
Total_Primary_Index_Pages        11554
Average_Number_Of_Index_Rows_Per_Base_Row        0

Test 1
0.0228s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        6.86%
Test 2
0.0197s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        29.01%
Test 3
0.0207s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        11.83%
Test 4
0.2183s
Internal_Filter_Efficiency        2.79%
Primary_Filter_Efficiency        49.22%
HHHH

Total_Primary_Index_Rows        4582733
Total_Primary_Index_Pages        13637
Average_Number_Of_Index_Rows_Per_Base_Row        0


Test 1
0.0162s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        11.70%
Test 2
0.0167s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        38.59%
Test 3
0.0144s
Internal_Filter_Efficiency        0%
Primary_Filter_Efficiency        19.05%
Test 4
0.1785s
Internal_Filter_Efficiency        9.30%
Primary_Filter_Efficiency        61.30%

As discussed above, the use of the Geography data type has constrained our choices with this data set, so HHHH is the natural choice for our index, based on it having the best filter efficiency, and unfortunately not really having any more granular settings at out disposal, i.e. HHHH is the most detailed index grid we can define.

When performing the selections on this data set with no index, the time taken to perform a table scan was around 35s, so the improvement in using the index was from 200 to 2000 times faster.

Unfortunately the lack of control over the granularity of the grid means that a lot of processing would be occurring in the database to perform each query, so I would expect performance to degrade quickly for this data set in a multi user environment.

While this data set was interesting to use from an analysis perspective, in an enterprise situation I would expect that the data would be tiled into logical regions, with queries incorporating one or more tiles depending on the size of the query.  In this way a tile would act as a kind of preliminary index.


Conclusion


Like any system implementation I think that applying a spatial index is something that needs to be reviewed and tweaked over the lifetime of the data set.  Depending on the type of data being kept, the data may incrementally change over time, possibly requiring tweaking of the index, but regardless of this, you will also learn more about the queries that users are performing against the data as time goes by, and this knowledge will help you design you spatial index.

Remember that your spatial index will have a profound effect on the performance of your database, especially in a multi user scenario, so the index should be designed well to meet the needs of the queries being performed.

Another suggestion that I haven't spoken of so far is that like typical SQL Server indexes, it is possible to have multiple spatial indexes on the same table, so if necessary you could have an index that is utilised by larger queries, such as view extents at a high scale, and another for granular spatial queries, and then use index hints to swap between the two when performing different operations.  This would split the processing and potentially improve the performance in a multi user environment.

The following are a summary of my main tips from what I have learned:

  • Always define a spatial index, no matter what the settings - don't get hung up on needing the perfect index from the start, as I have shown having any index improves spatial queries over no index at all.  As I said you can improve your index over time as you get to understand your users, and your data.
  • Make sure that your index is being used by checking the query plan.
  • Determine the density of your data, or at least be able to estimate the density of your desired index grid cells.
  • Get some information about the types of spatial selections users will be performing - perhaps log all user selections so that you can test your spatial indexes with these queries.
  • Use the tuning stored procedures - be objective about filter efficiency vs. number index rows, because an imbalance in either will affect your query performance, and therefore load on the server in a multi user environment.

My final comment would be that were you have the choice to use Geometry or Geography data, choose Geometry, if only for the control over the spatial index.  I see the index limitation with the Geography data type to be a major stumbling block, especially in situations such as my third case study.  I have heard that future versions of SQL Server will have the ability to have indexes with deeper grid levels, so this may alleviate the situation I have, so I look forward to seeing this new functionality.

4 comments:

  1. Amazing post, learned a lot about something i never really new about in SQL server (always something new to learn right?). Theres a nice page here too:

    SQL Index tuning tips

    ReplyDelete
  2. Hi
    Awesome post. I have a question though, I didnt get why you couldn't use geometry on the dcdb? I have to maintain a points dataset with geotagged photos in Qld, am I doing something ridiculously wrong by loading them up into a geometry feature class (they're photos geotagged from the gps in their exif)
    Regards

    ReplyDelete
    Replies
    1. Thanks for your feedback. It's been a while since I did this analysis/post so your are testing my memory a little :)

      I think one of the reasons I tested the two smaller datasets as geometry features and the large dataset as geography features was because my existing data was already in coordinate systems that fit that approach, i.e. the smaller datasets were in grid coordinates and the large was in longitude/latitude.

      In addition to this it was also an opportunity to test the differences in configuring indexes for both geometry and geography shapes. In hindsight, yes I could have done a second test of the large dataset in grid coordinates and compared the performance difference, but note it would need to have been a different projection than the smaller data sets are in (MGA56) otherwise I wouldn't have been able to include all the data, i.e. Queensland is covered by 3 zones for the MGA projection (54, 55, 56).

      So no, there is nothing that will technically prevent you from storing your longitude/latitude data as coordinates as a geometry, but you must remember that since these are polar coordinates, you would need to do your own distance calculations etc taking into considerations the curvature of the earth, i.e. the arc traced out by the great circle connecting your points etc. The other consideration is crossing the 180E/180W longitude line, but if your data is limited to Queensland only, then this wont be a problem.

      Later versions of SQL Server also have more capabilities for defining parameters of the spatial index, so even if you did use the geography type you could still define an efficient spatial index.

      Delete