Performance Evaluation of User Defined Functions for Impala

While working with the Impala installation in our lab cluster, we were talking about the possibility of having user defined functions in the database and luckily, with the most recent release of Impala (1.2) it is possible to not only supply user defined functions implemented in Java from Hive, but directly use UDFs implemented in C++.

Cloudera provides nice documentation and a blog post on how to get started with writing a user defined function and the results are very promising.

However, one thing that is not immediately clear is how to supply Impala with the LLVM compiled modules instead of traditional shared object files. The advantage of having LLVM IR code is that when generating the execution plan for the query an externally linked object cannot be optimized or inlined, while the emitted IR code can be “pasted” into the execution plan and can then be part of the traditional optimization passes of LLVM.

To use the IR code as UDF simply install clang from you repositories and run make. This will produce a .ll file containing the IR code. Now in the Impala shell run the following command (from the example)

create function has_vowels (string) returns boolean location '/user/hive/udfs/libudfsample.ll' symbol='HasVowels';

Instead of supplying the SO file, simply point to the LL file and Impala will use the IR code instead of the standard object file.

Now, custom UDF tend to be more complex compared to the primitives that are provided by the database system itself, so the question came up, how can we track the expected performance of an user defined function?

As a consequence, I wrote a small helper library that allows a performance evaluation of the UDF by looking at the number of instructions and the ratio of cycles and instructions, as these are the two metrics that will give you a good idea of the expected performance. To use the analyzer, simply clone my Impala UDF Sample repository and checkout the feature/papi_perf_eval branch.

Let’s look at the sample UDFs provided by this repository: To use the tracer, instantiate an executor object that will create the necessary function context for the UDF. Second, create a new scope and create an instance of ScopedTracer that takes two arguments – the name of the scope for printing and the number of times the block is executed, so that we can analyze on a per-call level for the UDF. Now, within a loop execute the UDF and supply it with input values. The overhead of the loop iteration will be smaller the larger the number of iterations.

#include "helper/papi-tracer.h"  
#include "helper/udf-execute.h"  

// ...  
impala_udf::UdfExecuteHelper executor;  
{   
  ScopedTracer sct("AddUdf", 1000u);  
  for(int i=0; i < sct.numCalls(); ++i) {  
    counter += executor.ExecuteUdf<IntVal, IntVal, IntVal>(  
      AddUdf, IntVal(i), IntVal(i+1)).val;  
  }  
}  

Once the ScopedTracer scope ends, the tracer will print the result containing the necessary information about the expected performance. Most interestingly, we try to predict the execution time based on this sample for 1B records and 3GhZ and 2GhZ machines to get a feeling for the overhead of the UDF and what we could improve. Most notably, the lower the number of instructions is, the better for the overall performance. See a sample output below:

--------------------------------------  
Performance Analysis for AddUdf  
Real Time:                  0  
Total Instructions:        64  
Instructions / Cycle:       2.53008  
Cycles / Call:             25  
Time in s per 10^9 calls @3GhZ: 8  
Time in s per 10^9 calls @2GhZ: 12  

For the future, I will look at providing the same kind of analysis as well for user defined aggregate functions and might look at counting branches as the next most important performance metric. If you have any more ideas, what kind of observations might be interesting for such an small code evaluation or if you like this idea, please leava a comment.