securityArticles Conferences Presentations

The Unbelievable Insecurity of the Big Data Stack: An Offensive Approach to Analyzing Huge and Complex Big Data Infrastructures

Conference: Defcon 29

2021-08-01

Summary

The presentation discusses the security risks associated with big data technologies and provides recommendations for securing them.

Big data technologies like Hadoop and Spark are vulnerable to remote code execution attacks through various interfaces and ports
Data injection channels like Spark Streaming and Scoop need to be secured to prevent injection of malicious data
Recommendations include removing or blocking unused dashboards and interfaces, implementing authentication and authorization, and securing communication channels between technologies

The speaker explains how an attacker can easily inject data into a Spark Streaming application through a TCP socket, potentially crashing the application or injecting malicious data into the Hadoop file system.

Abstract

Honoring the term, the variety of technologies in the Big Data stack is hugely BIG. Many complex components in charge of transport, storing, and processing millions of records make up Big Data infrastructures. The speed at which data needs to be processed and how quickly the implemented technologies need to communicate with each other make security lag behind. Once again, complexity is the worst enemy of security. Today, when conducting a security assessment on Big Data infrastructures, there is currently no methodology for it and there are hardly any technical resources to analyze the attack vectors. On top of that, many things that are considered vulnerabilities in conventional infrastructures, or even in the Cloud, are not vulnerabilities in this stack. What is a security problem and what is not a security problem in Big Data infrastructures? That is one of the many questions that this research answers. Security professionals need to count on a methodology and acquire the necessary skills to competently analyze the security of such infrastructures. This talk presents a methodology, and new and impactful attack vectors in the four layers of the Big Data stack: Data Ingestion, Data Storage, Data Processing and Data Access. Some of the techniques that will be exposed are the remote attack of the centralized cluster configuration managed by ZooKeeper; packet crafting for remote communication with the Hadoop RPC/IPC to compromise the HDFS; development of a malicious YARN application to achieve RCE; interfering data ingestion channels as well as abusing the drivers of HDFS-based storage technologies like Hive/HBase, and platforms to query multiple data lakes as Presto. In addition, security recommendations will be provided to prevent the attacks explained. REFERENCES: I plan to release a white paper at the conference, in the white paper there will be all the references. Anyway, as the attacks are novel, the references are related to infrastructure stuff mostly, not so much about security.

Materials:

Tags: