ETL - Extract Trino Load - A Case for Trino as a Batch Processing Engine

Conference: OpenAI + Data Forum 2022

2022-06-22

Authors: Andrii Rosa

Summary

Trino is an out-of-the-box solution for ETL that provides necessary execution and resource management capabilities to handle queries of practically any size at a cluster with limited resources. The solution breaks down the streaming exchange limitation and introduces a distributed buffer in between tasks, allowing each task in a query to be executed completely independently of any other. This simple yet powerful improvement reduces the amount of memory that must be available in a cluster to successfully execute a query and allows fine-grained failure recovery. The solution has been well vetted and battle-tested under high concurrency, achieving up to 60% of cost savings.

Trino is an out-of-the-box solution for ETL that provides necessary execution and resource management capabilities to handle queries of practically any size at a cluster with limited resources
The solution breaks down the streaming exchange limitation and introduces a distributed buffer in between tasks, allowing each task in a query to be executed completely independently of any other
This simple yet powerful improvement reduces the amount of memory that must be available in a cluster to successfully execute a query and allows fine-grained failure recovery
The solution has been well vetted and battle-tested under high concurrency, achieving up to 60% of cost savings

Trino was tested with overloading it with a lot of resource-intensive queries with high concurrency and all of them succeeded. The resource management primitives are also very well passed and solid.

Abstract

Trino is a relatively new name in the open source space that was formerly known as the PrestoSQL. Trino is very well known for fast adhoc and exploratory workloads on data lakes and heterogeneous data sources. When you want to provide your data scientists with the ability to query across your data landscape by joining live operational data with historical data, Trino is the state-of-the-art. Trino and Presto were initially built to replace Hive workloads at Facebook and handled massive petabyte-scale batch workloads. Yet across the board, Trino was not being widely adopted as a batch ETL engine to solve these workloads. As it turns out, one of the features that drive Trino's incredible speed was forgoing failure recovery measures to buy faster queries. In practice, many desire the opportunity to have the system running the query to facilitate the recovery from failures. The Trino community has banded around supporting native granular failure recovery to improve resiliency in the event of a failure. This brings Trino to a new frontier by enabling both exploratory and failure recovery for long-running workloads so that engineers and analysts do not have to shift between systems to run their queries.

Materials:

Tags: