More

Improve performance of a PostGIS st_dwithin query

Improve performance of a PostGIS st_dwithin query


I am doing a local statistic, similar to the the one described in an earlier question / answer:

SELECT a.tree_id, a.species, avg(b.age) as age_avg, count(*) as samples, a.geom FROM trees a LEFT JOIN trees b ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species WHERE a.age IS NULL GROUP BY a.tree_id, a.species, a.geom;

This finds all trees of the same species in a radius of 100 meters. This works quite nicely for small datasets with few missing data points (WHERE a.age IS NULL).

However, when I run the query for a larger dataset with more missing data, it gets very slow (i.e. several hours / days). In this case 6000 of of a total of 200000 points have no value (a.age).

Do you see a way to increase the speed of the query? Maybe an alternative function tost_dwithinis helpful?

UPDATE -EXPLAIN ANALYZEsays:

HashAggregate (cost=2721694211.11… 2721694280.52 rows=6941 width=26) (actual time=12571.761… 12571.774 rows=8 loops=1)

Group Key: a.id, a.age, a.species

-> Nested Loop Left Join (cost=0.00… 2721694141.70 rows=6941 width=26) (actual time=655.504… 12570.495 rows=167 loops=1) Join Filter: ((a.geom && st_expand(b.geom, 300::double precision)) AND (b.geom && st_expand(a.geom, 300::double precision)) AND _st_dwithin(a.geom, b.geom, 300::double precision)) Rows Removed by Join Filter: 11210316

-> Seq Scan on trees a (cost=0.00… 251349.76 rows=6941 width=54) (actual time=213.037… 1006.055 rows=8 loops=1) Filter: ((age IS NULL)) Rows Removed by Filter: 1401302

-> Materialize (cost=0.00… 254946.52 rows=1438701 width=36) (actual time=0.004… 326.335 rows=1401310 loops=8)

-> Seq Scan on trees b (cost=0.00… 247753.01 rows=1438701 width=36) (actual time=0.011… 1490.796 rows=1401310 loops=1)

Planning time: 0.186 ms Execution time: 12597.034 ms


The explain doesn't show an index coming into play, which could be for two reasons:

  • You don't have one. So make one withCREATE INDEX tree_gix ON trees USING GIST (geom)
  • Your data is in geographic coordinates, so your spatial join isn't really doing anything selective (it's joining every tree to all other trees, every time). In that case, either (a) change to using thegeographytype or (b) move your data to an appropriate planar projection (I recommend (b)).

Finally, yeah, theLEFT JOINis not doing anything useful for you unless there are lonely trees with no partners in the radius you need to keep in the result set. I'd remove the geometric and speciesGROUP BYas well, since you already have a unique id in there, the other variables are just noise.

SELECT a.tree_id, a.species, avg(b.age) as age_avg, count(*) as samples, a.geom FROM trees a JOIN trees b ON ST_DWithin(a.geom, b.geom, 100) AND a.species = b.species WHERE a.age IS NULL GROUP BY a.tree_id;

Other than the first two possible errors above, the query itself looks pretty neat and clean.


Watch the video: Spatial Queries - Working with Geometries #PostGIS. #PostgreSQL. #QGIS. UrduHindi. #13