Wednesday, April 24, 2013

Dave's Foreign Data --translating foreign-1 queries

As I described in a previous post, foreign-1 queries are queries that contain just one table, it being a foreign table. Foreign-1 queries allow for much more optimization than do foreign-n queries. With foreign-1 queries, dfd potentially sends all of the work to the foreign database.

Suppose there is a simple foreign table t(c1 int, c2 text) and dfd sees the query
select c2, sum(c1) from t where c1 > 0
  group by c2 having c2 like 'foo%'
If this were optimized like foreign-n queries then the following query would be sent to the foreign server:
select c2, c1 from t where c1 > 0
and this would bring back over the network all of the rows in the table where c1 > 0. But the entire query can be sent to the foreign server to execute the GROUP BY and aggregation on the cluster and only send back the groups and sums.

It isn't always possible to send the whole query, so dfd breaks down queries into a sequence of stages:

  1. projection list and WHERE clause
  2. aggregation and GROUP BY
  3. HAVING
  4. LIMIT
  5. OFFSET
  6. ORDER BY

Dfd will send the longest initial sequence of stages that the foreign server can support. If at any stage, it cannot handle the whole stage, then it does as much of that stage as it can and then bails on further optimization. For example, in the query
select c2, sum(c1) from t where c1&1=1
  group by c2 having c2 like 'foo%' order by 1
The foreign server does not have the bitwise-& operator so it cannot send the entire projection list and WHERE clause to the foreign server. Since it cannot do a complete job at stage one, it doesn't even try to optimize further stages. Dfd will send the following query to the remote server:
select c2, c1 from t
and let the GROUP BY, HAVING, and ORDER BY all be executed locally. If the query had been
select c2, sum(c1) from t where c1>0
  group by c2 having c2<'fff' and c2~='f*g*' order by 1
where the foreign server can handle the projection list and WHERE clause as well as the aggregation and GROUP BY, but not the entire HAVING clause, it would send the query as follows
select c2, sum(c1) from t where c1>0
  group by c2 having c2<'fff'
and give up on the rest of the HAVING clause and the ORDER BY clause.

The foreign server does not handle OFFSET at all, so basically if it can handle everything up to stage 4 and then it sees an OFFSET, it has to bail on the OFFSET and ORDER BY. Actually, it's a bit more complicated than that. There are some interesting relationship between LIMIT, OFFSET, and ORDER BY that dfd exploits to get the most work from the foreign cluster.

Clearly, fdw cannot handle these foreign-1 optimizations because fdw works at the level of a single plan node while these foreign-1 optimizations all completely rework the plan. This sort of optimizations has to be handled as a query rewrite, which happens before the planning phase.

For a discussion how SQL is generated and how subqueries are optimized, go here.

No comments:

Post a Comment