Tuesday, April 23, 2013

Dave's Foreign Data --translating foreign-n queries

This post describes more on my efforts to modify my own foreign-table infrastructure to Postgresql's fdw facility. The first post is here.

Dfd tries to do as much work as possible on the foreign server and minimize the amount of data that is sent over the network. There are a couple of reasons to maximize the work done on the foreign server. First, the database that dfd connects to is distributed across a cluster of machines to handle queries on massive data. Since the foreign server is really a cluster and has a lot more resources than the single-machine Postgresql instance, it make sense to do as much work as possible on the cluster. Second, the foreign server is optimized for the kinds of queries that run on it while Postgresql is not.

Minimizing the amount of data going over the network is a bit harder to justify since networks are so fast these days. But in general, you still don't want to burden the network with unnecessary data. Also, Postgresql has to handle all of the data that it sees, so minimizing the data that goes over the network is generally a way to move work off of Postgresql and on to the foreign-server cluster.

Dfd uses a plugin that sits between the parsing phase and the planning phase. This plugin transforms syntax trees that contain references to foreign tables. It works on the top-level query and independently works on subqueries. Dfd divides queries (and subqueries) into three categories. First there are queries that contain no foreign tables. These are called foreign-0 queries and Dfd does not modifiy them. Then there are queries that contain just one relation and that relation is a foreign table. These are called foreign-1 queries. Finally, there are all other queries which contain multiple tables and at least one foreign query. These are called foreign-n queries.

Preparing foreign-n queries

Dfd handles foreign-n queries much like fdw handles all queries. For each foreign table t in the FROM clause, dfd extracts the set of all WHERE-clause predicates that are exportable for that table and then determines the set of columns that are importable for that table.

A predicate is exportable for t if it can be executed on the foreign server. We always want to do this if possible because it both moves work to the foreign server and also can reduce the amount of data sent over the network while it can never increase the number of rows.

Specifically, a predicate p is exportable for t if
  1. p has at least one reference to a column from t
  2. p has no reference to columns from any other relations
  3. All of the functions, operators, and literals in p can be translated into equivalent functions, operators, and literals in the foreign server.
Columns are importable if they are needed for the query. Any columns that are not needed are not requested in Dfd. Specifically, a column of t is importable if it is referenced anywhere in the query other than in exportable predicates.

Here is an example. Suppose you have a foreign table t with the following columns (c1 int, c2 int, c3 float) and dfd sees the following query:
select c1 from t, local where c2=0 and c3=x
In this query, local is a local (not foreign) Postgresql table and x is one column of the table. The only exportable predicate in this query is c2=0. The other predicate, c3=x is not exportable because it contains a column from local. The importable columns are c1, because it appears in the projection list and c3 because it appears in a non-exportable predicate.

Lets call the list of exportable predicates Exp and the list of importable columns Imp. Dfd constructs a list of the columns of Imp along with their types. Lets call this list Sig. In the above example we get the following lists:
Exp = (c2=0)
Imp = (c1,c3)
Sig = (c1 intc3 float)
Dfd then replaces the reference to t with a function call to execute a foreign query. If the name of the foreign server is f, then the call looks like this:
dfd_call('f', 'SELECT Imp FROM t WHERE Exp') AS tmp(Sig)
In the above example the output table expression would be
dfd_call('f', 'select c1, c3 from t where c2=0')
  as tmp(c1 int, c3 float)
The query was compiled with the original columns of the table, (c1 int, c2 int, c3 float), but it's actually going to see just the columns (c1 int, c3 float). This means that the query thinks c3 is the third element in the result tuple, but there are now only two elements in the tuple. To fix this, dfd goes through the query and updates the references to all importable columns to be correct with the new tuple.

This last step is what I don't know if I can do with the Postgresql fdw facility. Fdw works at the planning level so the references to columns are no longer in just one convenient parse tree. They've been scattered all over the plan.

more here.

No comments:

Post a Comment