Apache Spark versions 3.2 and higher provide direct encryption capabilities for sensitive data sets. By configuring specific parameters and DataFrame options, Apache Parquet’s modular encryption mechanism can be activated, encrypting select columns with column-specific keys. Furthermore, the upcoming Spark 3.4 version will introduce support for uniform encryption, where all DataFrame columns can be encrypted using the same key.
Many companies are already leveraging Spark data encryption to safeguard personal or confidential business data in their production environments. The primary focus of integration efforts lies in key access control and the development of a Spark/Parquet plug-in code that can interact with the organization’s key management service (KMS).
In this session, we will provide an overview of Spark/Parquet encryption usage, and delve into the intricacies of encryption key management to facilitate the integration of this data protection mechanism in your deployment. Participants will learn how to execute a HelloWorld encryption sample and expand it into a real-world production code that seamlessly integrates with their organization’s KMS and access control policies. Topics covered will include the standard envelope encryption approach for big data protection, the trade-offs between performance and security in single and double envelope wrapping, and the storage of internal and external key metadata. Additionally, a demo will be presented, and new features such as uniform encryption and two-tier management of encryption keys will be discussed.
By the end of the session, attendees will have gained a comprehensive understanding of Spark/Parquet encryption, including its usage, key management considerations, and practical implementation in production environments. This knowledge will empower organizations to effectively protect their data assets while ensuring compliance with security and access control requirements.