Monday, August 8, 2011

Nought to PostGIS in 6.3 seconds…almost

In a recent post I decided to investigate some technologies that were new to me, such as leaflet, node.js and express, but in addition to these I had never interacted with PostgreSQL/PostGIS before either, so setting up PostGIS was part of my learning exercise.  Being a noob when it came to PostgreSQL and PostGIS, and fairly inexperienced in linux generally, I was able to find the information I needed to get going, but I thought I would pool it together in one location so others can see what I had to do to get my spatial database and queries working in PostGIS, for any other new comers to PostGIS to learn from as well.


Linux


I did my setup on a VirtualBox installation of Ubuntu 11.04.  My aim was to spend as little time as possible mucking around with installation of software, and maximise time spent on my investigations of leaflet and node.js, so where possible I wanted to install via the published packages available in my installation of Ubuntu without having to download any source code and compile my own versions of libraries.  One of my other reasons for wanting to install known published versions was that I intended to use node.js packages to access the PostGIS database, so I wanted to know up front what the documented version compatibilities were, and be able to use the known version as a starting point for investigating why my code might not work, but in the end I didn't have any problems, which is cool :)


Install PostgreSQL


PostgreSQL was relatively easy to install using the packages in Ubuntu, which I installed from the command line, but you could also search for and install these packages in the Synaptic Package Manager.
sudo apt-get install postgresql postgresql-client postgresql-contrib pgadmin3


Install PostGIS


PostGIS is an additional install on top of PostgreSQL.  In Ubuntu 11.04 the version of PostgreSQL installed is 8.4, so the package required to install PostGIS is postgresql-8.4-postgis, which can be installed with the following command.
sudo apt-get install postgresql-8.4-postgis


Change the admin password


When the system is installed, a good practice is to set the administrative user's password.  This is done by setting the password for the postgres user in the database, but in addition to this we change the linux user account password as well, so that they are synchronised. 


Database Login
sudo su postgres -c psql postgres
postgres=# ALTER USER postgres WITH PASSWORD '<password>';
postgres=# \q


Ubuntu Login
sudo passwd -d postgres
sudo su postgres -c passwd
Now enter the same password that you used for the postgres database login above.


Remote Access


For my purposes I did not need to access the database from a remote machine because I indended to run Node.js on the same machine, but if you need to configure remote access for the database, have a look at the following page by Stuart Ellis for further information.


Create a template database


To make it easier to create new spatial databases in future, the steps below go through a procedure to create a template database from which new databases can be spawned.


Create the template database and run the PostGIS scripts to add the required database objects for PostGIS
sudo su postgres
createdb postgistemplate
createlang plpgsql postgistemplate
psql -d postgistemplate -f /usr/share/postgresql/8.4/contrib/postgis-1.5/postgis.sql
psql -d postgistemplate -f /usr/share/postgresql/8.4/contrib/postgis-1.5/spatial_ref_sys.sql
psql -d postgistemplate -f /usr/share/postgresql/8.4/contrib/postgis_comments.sql
...and then test it
psql -d postgistemplate -c "SELECT postgis_full_version();"


Create Roles/Users


The template database will be configured with a role called SpatialGroup.  Users will be assigned as members of the SpatialGroup role if they need to read or update spatial data.  The database can be opened up/locked down accordingly to give those users the bare permissions they need to the spatial data.


For now I am only using a single login called spatial, which will also be the owner of the PostGIS tables.  This is the user I connected to the database with in my previous example post.


The following commands create the role and user described above.  Make sure to choose a password for the login and replace the placeholder in the sql below.
psql -d postgres
postgres=# CREATE ROLE SpatialGroup NOSUPERUSER NOINHERIT CREATEDB NOCREATEROLE;
postgres=# CREATE ROLE spatial LOGIN PASSWORD '<password>' NOINHERIT;
postgres=# GRANT SpatialGroup TO spatial;


Assign permissions to template database


We need to assign permissions for the postgistemplate tables (geometry_columns and spatial_ref_sys) which will be owned by the spatial user.


Exit from the previous connection (type \q), and connect to the postgistemplate database as the postgres user
psql -d postgistemplate
Then assign the permissions...
postgistemplate=# ALTER TABLE geometry_columns OWNER TO spatial;
postgistemplate=# ALTER TABLE spatial_ref_sys OWNER TO spatial;
Create a schema for your spatial data (its best to store the GIS related tables in a custom schema rather than the public schema to aid future maintenance or administration of security)
postgistemplate=# CREATE SCHEMA SpatialSchema AUTHORIZATION spatial;

Exit from the connection (type \q)



Create a database from the template


Now we can create new spatial databases from the template we created
createdb -T postgistemplate -O spatial SpatialDatabase


Load Data


You can use the shp2pgsql utility to load you spatial data from your shapefile into PostGIS.  There is documentation on the command line arguments in the PostGIS documentation.  Some of the notable arguments are I, which will create an index on your table after the data is loaded, and G which will create your shapes as a geography data type rather than a geometry data type, but you must make sure that your data is already in WGS84 long/lat.


I also used the D argument which the documentation indicates will run faster for large data sets, such as the one I am working with.
shp2pgsql -I -c -G -D /media/queensland/cadastre.shp SpatialSchema.Cadastre | psql -d SpatialDatabase -U spatial 


Indexes


Earlier in the year I did a post on spatial indexes in SQL Server and how to analyse spatial data to improve spatial indexes.  It is interesting to highlight that indexes in SQL Server are grid based, which forces us to be mindful of our data distribution when configuring the index grid parameters.


Contrary to this approach, PostGIS uses an index type called GiST or Generalized Search Tree.  This index structure is used for indexing data that is too complex to be indexed using typical single dimensioned BTrees or something similar.  For GIS data the GiST index stores index topology such as other features contained within a feature, or which features overlap a feature etc.


PostGIS uses an RTree implementation of the GiST index structure to lookup GIS data.


If you need to create a spatial index on a table, and didn't set the command line argument in the shp2pgsql utility, then you can easily create a spatial index with the following command
CREATE INDEX IX_SP_state_1 ON state_1 USING GIST ( geom )
...where IX_SP_state_1 is the desired index name, state_1 is the name of the table, and geom is the name of the geometry/geography column.



Is my index being used by my query?


A problem I encountered in my example post was that my spatial query which used the ST_Intersects function in the filter expression would not utilise the spatial index, causing the query shown below to run for more than 20 seconds.  I was able to confirm that the query was not using the index by checking the query plan in pgAdmin.
If you execute the statement with the Explain Query button the will result will show the query plan, which should show whether an index will be employed on not.


In the case of the following statement
select ST_AsGeoJSON(geog)
from SpatialSchema.Cadastre
where ST_Intersects(geog, ST_GeogFromText('SRID=4326;POLYGON((152.753145 -27.611423,152.753885 -27.613172,152.757018 -27.61244,152.755065 -27.610358,152.753145 -27.611423))'));
I get the query plan



"Seq Scan on state_1  (cost=0.00..789592.03 rows=860124 width=1082)"
"  Filter: st_intersects(geog, '0103000020E61000000100000005000000AB048BC319186340AF05BD37869C3BC03D7E6FD31F186340AAD216D7F89C3BC0770FD07D3918634022C32ADEC89C3BC0562B137E291863401BDA006C409C3BC0AB048BC319186340AF05BD37869C3BC0'::geography)"


This indicates my query will perform a table scan to select the matching records.  As discussed in my earlier post, this data set represents all the cadastral boundaries in Queensland, so it is a significant table to be performing a table scan over.


I was able to improve the query by including the following expression to my filter before my ST_Intersects() function.
geog && ST_GeogFromText('SRID=4326;POLYGON((152.753145 -27.611423,152.753885 -27.613172,152.757018 -27.61244,152.755065 -27.610358,152.753145 -27.611423))')
The && operator instructs the query engine to select the features whose bounding boxes intersect the shape's bounding box.  This results in a smaller sample that can then be filtered with the ST_Interesects() filter.


The query plan for this query is shown below, indicating that it will use the index to perform an index scan.



"Index Scan using ix_sp_state_1 on state_1  (cost=0.00..9.61 rows=1 width=1082)"
"  Index Cond: (geog && '0103000020E61000000100000005000000AB048BC319186340AF05BD37869C3BC03D7E6FD31F186340AAD216D7F89C3BC0770FD07D3918634022C32ADEC89C3BC0562B137E291863401BDA006C409C3BC0AB048BC319186340AF05BD37869C3BC0'::geography)"
"  Filter: st_intersects(geog, '0103000020E61000000100000005000000AB048BC319186340AF05BD37869C3BC03D7E6FD31F186340AAD216D7F89C3BC0770FD07D3918634022C32ADEC89C3BC0562B137E291863401BDA006C409C3BC0AB048BC319186340AF05BD37869C3BC0'::geography)"


2 comments:

  1. Welcome to the wonderful world of Linux based GIS servers :-))

    I would strongly recommend to take a look at Tilemill (http://tilemill.com) It combines nodes.js with Mapnik (https://wiki.openstreetmap.org/wiki/Mapnik) rendering and would fit well with your setup.

    Tilemill MBtiles export function lacks a lot of functionality. But you can easily design a Map with Tilemill, save the Mapnik configuration and use it in combination with e.g. OGC Server (http://trac.mapnik.org/wiki/OgcServerSvn)

    ReplyDelete
  2. Thank you for this article. It helped me a lot!

    ReplyDelete