Itlize

Blog

Solving Query Optimization in Presto

Jason Li
Sr. Software Development Engineer
Skilled Angular and .NET developer, team leader for a healthcare insurance company.
December 11, 2020

Presto

Presto, an open-source distributed SQL (Structured Query Language) query engine, run interactive & analytic queries against various data sources of numerous sizes ranging from gigabytes to petabytes. The SQL query engine Presto got designed and written from scratch for a perfect implementation of interactive analytics.

What does Presto do?

In all its glory, Presto allows querying chunks of big data where it lives, which include Cassandra, Hive, relational databases, and proprietary data stores. A single Presto query combines big data from multiple sources, which allows running analytics across an entire business organization.

Presto gets targeted by analysts who are on the lookout for exact response times, which range from sub-second to minutes. An SQL query engine Presto breaks the false choice between fast analytics and a slow solution, which requires excessive hardware. The fast analytics get done through an expensive commercial solution to your queries.

Presto gets used by:

Notable Presto users include Facebook, Airbnb, and Dropbox. Facebook employees use Presto for interactive queries. These queries are used against several internal data stores.

Solving Query Optimization in Presto

Query optimization in Presto becomes smarter & more efficient over repeated use by combining machine learning as well as adaptive query execution. Structured Query Language (SQL) on everything is an apt tagline that gets associated with Presto. The query engine got developed by Facebook for rapidly analyzing massive amounts of data; big data that lay scattered across multiple formats & sources.

Research shows that a decade ago the traditional approach for a business enterprise or organization to handle its data processing needs to set up a data centre stocking it with CPUs, & hard-drives. Relevant software got acquired to tame, store, and analyze data. The whole process required investment in several software licenses & associated service contracts.

Data services get used in bursts at the beginning of a week & end of a quarter as it handled quite a lot more traffic than any other times. These resources are statistically allocated and get provisioned for the peak usage & left under-utilized during the rest of the time. The business enterprises or organizations need a staff of engineering teams to keep this setup operational. This operational process ensures high availability, and troubleshoot various use cases.

Presto, the query engine, works well within an auto-scaling context that sees an increased adoption as more business enterprises move big data onto the cloud. The software Presto has an extensible, and federated design that allows reading & processing data seamlessly from several disparate sources as well as file formats.

The federated architecture of Presto is quite beneficial as it is able to process data in place. Presto engenders significant complexity while generating an optimal execution plan for a query.

Evolution of the Query Optimizer

An evolution of query optimizer takes a path that describes any generic problem & some of the solutions that got developed as a result of it. The query optimizers responsibly convert SQL, which is expressed declaratively to an ancient sequence of operations performed by the engine based on underlying data. Hence, query optimizers become a critical component of databases.

As a matter of fact, the input to query optimizer is quite a logical plan. The plan itself generates from the result of parsing the input SQL. This whole process converts it into a high-level collection of several operations required to execute a query! The query optimizer then works for converting a logical plan into an efficient execution strategy based on available operators with data layout characteristics taken into consideration.

Research showcase that for a typical SQL query, there does exist one logical plan. Many strategies get devised for the implementation & execution process of a logical plan to arrive at the desired results. The query optimizer becomes responsible for choosing the best execution plan.

As per studies, most query optimizers use a set of heuristic rules for curtailing the search space. The goals on focus include:

• Minimizing the data read from disk.

• Reducing network transfer of data.

Ultimate Aim: Fast query execution and fewer resources.

The query optimizer, Presto, sets rules that operate on the logical plan until a fixed point is reached with a bit of mutation as well.

presto

A Concrete Example

The following example got picked from a suite of ad-hoc queries, which is part of a commonly used decision benchmark such as TPC-H.

This particular query performs a three-way join between the data cables, customer service, and orders. It narrows the result set by applying a set of filters, which include market segments, order dates, & shipping dates. The query calculates an aggregate sum by grouping on each distinct combination of order, order-dates, & shipping priorities. It categorizes the result set by descending order of a computed column that comprises of order dates & revenue.

The Naive Approach

A naive approach focus on optimizing the query by performing a full cross join on the three tables, which form a Cartesian product. It is necessary to eliminate from this set all the tuples that do not satisfy the engaged filters in the "where" clause. An aggregation gets performed by identifying each unique combination of order-key, order-date, and shipping priorities. The sum is calculated based on the extended price from which a discount is subtracted. Finally, the result gets ordered based on the order-date. The operational sequence is guaranteed to produce accurate results. However, it does not work even for a moderate size dataset in most hardware.

A Rule-Based Optimizer (RBO)

The rule-based optimizer framework mitigates almost all the problems that occur in the naive approach. An RBO generates a particular plan in which the predicates get applied while the data is getting read. It brings in specificity to the same parameters used in a naive approach. This approach reduces the size of an intermediate result set by several orders of magnitude. The RBO cannot suggest a Cartesian product in all three tables for an intermediate result. It in-turn joins the first two tables thereby retaining those tuples that match the predicate: customer-key. RBO then performs another join between this intermediate result and the Understanding of the several serverless limitations line-item table.

Two advantages of RBO methodology

• A greatly reduced memory that is required to compute this join as it aggressively applies filters to take out the tuples that are not of interest.

• Enabling of efficient algorithms for processing this join.

A Cost-Based Optimizer (CBO)

A cost-based optimizer is the next-in-line after RBO in the evolution of query optimizers. The CBO takes-off from the fixed point reached by RBO. A cost-based optimizer (CBO) relies on the engine to provide it with selective information of these predicates, which big data relies on. It then uses the information to estimate the sizes of all join inputs. Once the filters get applied, the number of records flowing in may actually be fewer.

A cost-based optimizer (CBO) propagates through the estimated plan of certain operations such as joins or aggregations. It uses these to make intelligent choices in several other parts of a query plan.

Query Optimization in Presto compared with other traditional databases

The cost-based optimization (CBO) does not get realized through Presto. The traditional databases do not decouple & compute or interface with any data sources other than those which are in use. These databases analyze as well as store relevant statistics about their data. Presto works with big data where it gets taken & used from disparate processes; the scale of data being several orders of magnitude larger. As a data federation engine, Presto interfaces with multiple data-stores of various characters.

The Smarter Query Optimization Plan for Presto

Adaptive Query Execution (AQE) helps solve the problem faced by Presto without constructing a federated cost model for the SQL query engine. The paradigm of Adaptive Query Execution removes an architectural distinction between query planning as well as query execution. AQE is a framework in which an execution plan becomes adaptive as it monitors the data while getting processed. It changes shape depending on several characteristics of data-flowing through various operators in the plan.

The Adaptive Query Execution takes a query plan, which gets produced as a result of the heuristic rule-based optimization. It then re-orders operators while re-planning sub-queries based on running time performance. This partitions a query plan into sub-plans that gets independently altered.

During every query execution, the system monitor & detect a sub-optimal sub-plan query. This gets detected based on Key Performance Indicators (KPIs), which dynamically re-plan the big data fragments. Thus, AQE help Presto to work efficiently in a streamlined fashion while optimizing the SQL query engine in the process. The above discussed are the ways for solving query optimization in Presto.

Conclusion

A comparison between various query optimizers & Presto clearly showcases the superior quality of the SQL based query engine and why all business enterprises or organizations are interested in it.