Sunday, May 28, 2023
HomeBig DataImplement column-level encryption to guard delicate information in Amazon Redshift with AWS...

Implement column-level encryption to guard delicate information in Amazon Redshift with AWS Glue and AWS Lambda user-defined capabilities


Amazon Redshift is a massively parallel processing (MPP), absolutely managed petabyte-scale information warehouse that makes it easy and cost-effective to investigate all of your information utilizing present enterprise intelligence instruments.

When companies are modernizing their information warehousing options to Amazon Redshift, implementing extra information safety mechanisms for delicate information, comparable to personally identifiable data (PII) or protected well being data (PHI), is a standard requirement, particularly for these in extremely regulated industries with strict information safety and privateness mandates. Amazon Redshift gives role-based entry management, row-level safety, column-level safety, and dynamic information masking, together with different database safety features to allow organizations to implement fine-grained information safety.

Safety-sensitive purposes typically require column-level (or field-level) encryption to implement fine-grained safety of delicate information on prime of the default server-side encryption (particularly information encryption at relaxation). In different phrases, delicate information ought to be at all times encrypted on disk and stay encrypted in reminiscence, till customers with correct permissions request to decrypt the information. Column-level encryption gives an extra layer of safety to guard your delicate information all through system processing in order that solely sure customers or purposes can entry it. This encryption ensures that solely approved principals that want the information, and have the required credentials to decrypt it, are in a position to take action.

On this put up, we reveal how one can implement your personal column-level encryption mechanism in Amazon Redshift utilizing AWS Glue to encrypt delicate information earlier than loading information into Amazon Redshift, and utilizing AWS Lambda as a user-defined perform (UDF) in Amazon Redshift to decrypt the information utilizing normal SQL statements. Lambda UDFs may be written in any of the programming languages supported by Lambda, comparable to Java, Go, PowerShell, Node.js, C#, Python, Ruby, or a customized runtime. You should utilize Lambda UDFs in any SQL assertion comparable to SELECT, UPDATE, INSERT, or DELETE, and in any clause of the SQL statements the place scalar capabilities are allowed.

Resolution overview

The next diagram describes the answer structure.

Architecture Diagram

For example how you can arrange this structure, we stroll you thru the next steps:

  1. We add a pattern information file containing artificial PII information to an Amazon Easy Storage Service (Amazon S3) bucket.
  2. A pattern 256-bit information encryption key’s generated and securely saved utilizing AWS Secrets and techniques Supervisor.
  3. An AWS Glue job reads the information file from the S3 bucket, retrieves the information encryption key from Secrets and techniques Supervisor, performs information encryption for the PII columns, and masses the processed dataset into an Amazon Redshift desk.
  4. We create a Lambda perform to reference the identical information encryption key from Secrets and techniques Supervisor, and implement information decryption logic for the obtained payload information.
  5. The Lambda perform is registered as a Lambda UDF with a correct AWS Id and Entry Administration (IAM) position that the Amazon Redshift cluster is allowed to imagine.
  6. We are able to validate the information decryption performance by issuing pattern queries utilizing Amazon Redshift Question Editor v2.0. You could optionally select to check it with your personal SQL consumer or enterprise intelligence instruments.

Stipulations

To deploy the answer, be certain to finish the next conditions:

  • Have an AWS account. For this put up, you configure the required AWS sources utilizing AWS CloudFormation within the us-east-2 Area.
  • Have an IAM consumer with permissions to handle AWS sources together with Amazon S3, AWS Glue, Amazon Redshift, Secrets and techniques Supervisor, Lambda, and AWS Cloud9.

Deploy the answer utilizing AWS CloudFormation

Provision the required AWS sources utilizing a CloudFormation template by finishing the next steps:

  1. Sign up to your AWS account.
  2. Select Launch Stack:
    Launch Button
  3. Navigate to an AWS Area (for instance, us-east-2).
  4. For Stack identify, enter a reputation for the stack or go away as default (aws-blog-redshift-column-level-encryption).
  5. For RedshiftMasterUsername, enter a consumer identify for the admin consumer account of the Amazon Redshift cluster or go away as default (grasp).
  6. For RedshiftMasterUserPassword, enter a robust password for the admin consumer account of the Amazon Redshift cluster.
  7. Choose I acknowledge that AWS CloudFormation may create IAM sources.
  8. Select Create stack.
    Create CloudFormation stack

The CloudFormation stack creation course of takes round 5–10 minutes to finish.

  1. When the stack creation is full, on the stack Outputs tab, document the values of the next:
    1. AWSCloud9IDE
    2. AmazonS3BucketForDataUpload
    3. IAMRoleForRedshiftLambdaUDF
    4. LambdaFunctionName

CloudFormation stack output

Add the pattern information file to Amazon S3

To check the column-level encryption functionality, you may obtain the pattern artificial information generated by Mockaroo. The pattern dataset comprises artificial PII and delicate fields comparable to telephone quantity, e mail handle, and bank card quantity. On this put up, we reveal how you can encrypt the bank card quantity discipline, however you may apply the identical technique to different PII fields based on your personal necessities.

Sample synthetic data

An AWS Cloud9 occasion is provisioned for you in the course of the CloudFormation stack setup. You could entry the occasion from the AWS Cloud9 console, or by visiting the URL obtained from the CloudFormation stack output with the important thing AWSCloud9IDE.

CloudFormation stack output for AWSCloud9IDE

On the AWS Cloud9 terminal, copy the pattern dataset to your S3 bucket by operating the next command:

S3_BUCKET=$(aws s3 ls| awk '{print $3}'| grep awsblog-pii-data-input-)
aws s3 cp s3://aws-blogs-artifacts-public/artifacts/BDB-2274/pii-sample-dataset.csv s3://$S3_BUCKET/

Upload sample dataset to S3

Generate a secret and safe it utilizing Secrets and techniques Supervisor

We generate a 256-bit secret for use as the information encryption key. Full the next steps:

  1. Create a brand new file within the AWS Cloud9 setting.
    Create new file in Cloud9
  2. Enter the next code snippet. We use the cryptography bundle to create a secret, and use the AWS SDK for Python (Boto3) to securely retailer the key worth with Secrets and techniques Supervisor:
    from cryptography.fernet import Fernet
    import boto3
    import base64
    
    key = Fernet.generate_key()
    consumer = boto3.consumer('secretsmanager')
    
    response = consumer.create_secret(
        Title="data-encryption-key",
        SecretBinary=base64.urlsafe_b64decode(key)
    )
    
    print(response['ARN'])

  3. Save the file with the file identify generate_secret.py (or any desired identify ending with .py).
    Save file in Cloud9
  4. Set up the required packages by operating the next pip set up command within the terminal:
    pip set up --user boto3
    pip set up --user cryptography

  5. Run the Python script by way of the next command to generate the key:
    python generate_secret.py

    Run Python script

Create a goal desk in Amazon Redshift

A single-node Amazon Redshift cluster is provisioned for you in the course of the CloudFormation stack setup. To create the goal desk for storing the dataset with encrypted PII columns, full the next steps:

  1. On the Amazon Redshift console, navigate to the listing of provisioned clusters, and select your cluster.
    Amazon Redshift console
  2. To hook up with the cluster, on the Question information drop-down menu, select Question in question editor v2.
    Connect with Query Editor v2
  3. If that is the primary time you’re utilizing the Amazon Redshift Question Editor V2, settle for the default setting by selecting Configure account.
    Configure account
  4. To hook up with the cluster, select the cluster identify.
    Connect to Amazon Redshift cluster
  5. For Database, enter demodb.
  6. For Person identify, enter grasp.
  7. For Password, enter your password.

You could want to vary the consumer identify and password based on your CloudFormation settings.

  1. Select Create connection.
    Create Amazon Redshift connection
  2. Within the question editor, run the next DDL command to create a desk named pii_table:
    CREATE TABLE pii_table(
      id BIGINT,
      full_name VARCHAR(50),
      gender VARCHAR(10),
      job_title VARCHAR(50),
      spoken_language VARCHAR(50),
      contact_phone_number VARCHAR(20),
      email_address VARCHAR(50),
      registered_credit_card VARCHAR(50)
    );

We suggest utilizing the smallest potential column dimension as a finest observe, and chances are you’ll want to change these desk definitions per your particular use case. Creating columns a lot bigger than mandatory will have an effect on the scale of knowledge tables and have an effect on question efficiency.

Create Amazon Redshift table

Create the supply and vacation spot Knowledge Catalog tables in AWS Glue

The CloudFormation stack provisioned two AWS Glue information crawlers: one for the Amazon S3 information supply and one for the Amazon Redshift information supply. To run the crawlers, full the next steps:

  1. On the AWS Glue console, select Crawlers within the navigation pane.
    AWS Glue Crawlers
  2. Choose the crawler named glue-s3-crawler, then select Run crawler to set off the crawler job.
    Run Amazon S3 crawler job
  3. Choose the crawler named glue-redshift-crawler, then select Run crawler.
    Run Amazon Redshift crawler job

When the crawlers are full, navigate to the Tables web page to confirm your outcomes. You need to see two tables registered underneath the demodb database.

AWS Glue database tables

Creator an AWS Glue ETL job to carry out information encryption

An AWS Glue job is provisioned for you as a part of the CloudFormation stack setup, however the extract, remodel, and cargo (ETL) script has not been created. We create and add the ETL script to the /glue-script folder underneath the provisioned S3 bucket with the intention to run the AWS Glue job.

  1. Return to your AWS Cloud9 setting both by way of the AWS Cloud9 console, or by visiting the URL obtained from the CloudFormation stack output with the important thing AWSCloud9IDE.
    CloudFormation stack output for AWSCloud9IDE

We use the Miscreant bundle for implementing a deterministic encryption utilizing the AES-SIV encryption algorithm, which implies that for any given plain textual content worth, the generated encrypted worth can be at all times the identical. The good thing about utilizing this encryption method is to permit for level lookups, equality joins, grouping, and indexing on encrypted columns. Nonetheless, you also needs to pay attention to the potential safety implication when making use of deterministic encryption to low-cardinality information, comparable to gender, boolean values, and standing flags.

  1. Create a brand new file within the AWS Cloud9 setting and enter the next code snippet:
    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsglue.dynamicframe import DynamicFrameCollection
    from awsglue.dynamicframe import DynamicFrame
    
    import boto3
    import base64
    from miscreant.aes.siv import SIV
    from pyspark.sql.capabilities import udf, col
    from pyspark.sql.sorts import StringType
    
    args = getResolvedOptions(sys.argv, ["JOB_NAME", "SecretName", "InputTable"])
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args["JOB_NAME"], args)
    
    # retrieve the information encryption key from Secrets and techniques Supervisor
    secret_name = args["SecretName"]
    
    sm_client = boto3.consumer('secretsmanager')
    get_secret_value_response = sm_client.get_secret_value(SecretId = secret_name)
    data_encryption_key = get_secret_value_response['SecretBinary']
    siv = SIV(data_encryption_key)  # With out nonce, the encryption turns into deterministic
    
    # outline the information encryption perform
    def pii_encrypt(worth):
        if worth is None:
            worth = ""
        ciphertext = siv.seal(worth.encode())
        return base64.b64encode(ciphertext).decode('utf-8')
    
    # register the information encryption perform as Spark SQL UDF   
    udf_pii_encrypt = udf(lambda z: pii_encrypt(z), StringType())
    
    # outline the Glue Customized Remodel perform
    def Encrypt_PII (glueContext, dfc) -> DynamicFrameCollection:
        newdf = dfc.choose(listing(dfc.keys())[0]).toDF()
        
        # PII fields to be encrypted
        pii_col_list = ["registered_credit_card"]
    
        for pii_col_name in pii_col_list:
            newdf = newdf.withColumn(pii_col_name, udf_pii_encrypt(col(pii_col_name)))
    
        encrypteddyc = DynamicFrame.fromDF(newdf, glueContext, "encrypted_data")
        return (DynamicFrameCollection({"CustomTransform0": encrypteddyc}, glueContext))
    
    # Script generated for node S3 bucket
    S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(
        database="demodb",
        table_name=args["InputTable"],
        transformation_ctx="S3bucket_node1",
    )
    
    # Script generated for node ApplyMapping
    ApplyMapping_node2 = ApplyMapping.apply(
        body=S3bucket_node1,
        mappings=[
            ("id", "long", "id", "long"),
            ("full_name", "string", "full_name", "string"),
            ("gender", "string", "gender", "string"),
            ("job_title", "string", "job_title", "string"),
            ("spoken_language", "string", "spoken_language", "string"),
            ("contact_phone_number", "string", "contact_phone_number", "string"),
            ("email_address", "string", "email_address", "string"),
            ("registered_credit_card", "long", "registered_credit_card", "string"),
        ],
        transformation_ctx="ApplyMapping_node2",
    )
    
    # Customized Remodel
    Customtransform_node = Encrypt_PII(glueContext, DynamicFrameCollection({"ApplyMapping_node2": ApplyMapping_node2}, glueContext))
    
    # Script generated for node Redshift Cluster
    RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_catalog(
        body=Customtransform_node,
        database="demodb",
        table_name="demodb_public_pii_table",
        redshift_tmp_dir=args["TempDir"],
        transformation_ctx="RedshiftCluster_node3",
    )
    
    job.commit()

  2. Save the script with the file identify pii-data-encryption.py.
    Save file in Cloud9
  3. Copy the script to the specified S3 bucket location by operating the next command:
    S3_BUCKET=$(aws s3 ls| awk '{print $3}'| grep awsblog-pii-data-input-)
    aws s3 cp pii-data-encryption.py s3://$S3_BUCKET/glue-script/pii-data-encryption.py

    Upload AWS Glue script to S3

  4. To confirm the script is uploaded efficiently, navigate to the Jobs web page on the AWS Glue console.You need to be capable of discover a job named pii-data-encryption-job.
    AWS Glue console
  5. Select Run to set off the AWS Glue job.It’s going to first learn the supply information from the S3 bucket registered within the AWS Glue Knowledge Catalog, then apply column mappings to rework information into the anticipated information sorts, adopted by performing PII fields encryption, and at last loading the encrypted information into the goal Redshift desk. The entire course of ought to be accomplished inside 5 minutes for this pattern dataset.AWS Glue job scriptYou’ll be able to swap to the Runs tab to observe the job standing.
    Monitor AWS Glue job

Configure a Lambda perform to carry out information decryption

A Lambda perform with the information decryption logic is deployed for you in the course of the CloudFormation stack setup. You could find the perform on the Lambda console.

AWS Lambda console

The next is the Python code used within the Lambda perform:

import boto3
import os
import json
import base64
import logging
from miscreant.aes.siv import SIV

logger = logging.getLogger()
logger.setLevel(logging.INFO)

secret_name = os.environ['DATA_ENCRYPT_KEY']

sm_client = boto3.consumer('secretsmanager')
get_secret_value_response = sm_client.get_secret_value(SecretId = secret_name)
data_encryption_key = get_secret_value_response['SecretBinary']

siv = SIV(data_encryption_key)  # With out nonce, the encryption turns into deterministic

# outline lambda perform logic
def lambda_handler(occasion, context):
    ret = dict()
    res = []
    for argument in occasion['arguments']:
        encrypted_value = argument[0]
        strive:
            de_val = siv.open(base64.b64decode(encrypted_value)) # carry out decryption
        besides:
            de_val = encrypted_value
            logger.warning('Decryption for worth failed: ' + str(encrypted_value)) 
        res.append(json.dumps(de_val.decode('utf-8')))

    ret['success'] = True
    ret['results'] = res

    return json.dumps(ret) # return decrypted outcomes

If you wish to deploy the Lambda perform by yourself, be certain to incorporate the Miscreant bundle in your deployment bundle.

Register a Lambda UDF in Amazon Redshift

You’ll be able to create Lambda UDFs that use customized capabilities outlined in Lambda as a part of your SQL queries. Lambda UDFs are managed in Lambda, and you’ll management the entry privileges to invoke these UDFs in Amazon Redshift.

  1. Navigate again to the Amazon Redshift Question Editor V2 to register the Lambda UDF.
  2. Use the CREATE EXTERNAL FUNCTION command and supply an IAM position that the Amazon Redshift cluster is allowed to imagine and make calls to Lambda:
    CREATE OR REPLACE EXTERNAL FUNCTION pii_decrypt (worth varchar(max))
    RETURNS varchar STABLE
    LAMBDA '<--Change-with-your-lambda-function-name-->'
    IAM_ROLE '<--Change-with-your-redshift-lambda-iam-role-arn-->';

You could find the Lambda identify and Amazon Redshift IAM position on the CloudFormation stack Outputs tab:

  • LambdaFunctionName
  • IAMRoleForRedshiftLambdaUDF

CloudFormation stack output
Create External Function in Amazon Redshift

Validate the column-level encryption performance in Amazon Redshift

By default, permission to run new Lambda UDFs is granted to PUBLIC. To limit utilization of the newly created UDF, revoke the permission from PUBLIC after which grant the privilege to particular customers or teams. To be taught extra about Lambda UDF safety and privileges, see Managing Lambda UDF safety and privileges.

You have to be a superuser or have the sys:secadmin position to run the next SQL statements:

GRANT SELECT ON "demodb"."public"."pii_table" TO PUBLIC;
CREATE USER regular_user WITH PASSWORD '1234Test!';
CREATE USER privileged_user WITH PASSWORD '1234Test!';
REVOKE EXECUTE ON FUNCTION pii_decrypt(varchar) FROM PUBLIC;
GRANT EXECUTE ON FUNCTION pii_decrypt(varchar) TO privileged_user;

First, we run a SELECT assertion to confirm that our extremely delicate information discipline, on this case the registered_credit_card column, is now encrypted within the Amazon Redshift desk:

SELECT * FROM "demodb"."public"."pii_table";

Select statement

For normal database customers who haven’t been granted the permission to make use of the Lambda UDF, they’ll see a permission denied error once they attempt to use the pii_decrypt() perform:

SET SESSION AUTHORIZATION regular_user;
SELECT *, pii_decrypt(registered_credit_card) AS decrypted_credit_card FROM "demodb"."public"."pii_table";

Permission denied

For privileged database customers who’ve been granted the permission to make use of the Lambda UDF for decrypting the information, they will situation a SQL assertion utilizing the pii_decrypt() perform:

SET SESSION AUTHORIZATION privileged_user;
SELECT *, pii_decrypt(registered_credit_card) AS decrypted_credit_card FROM "demodb"."public"."pii_table";

The unique registered_credit_card values may be efficiently retrieved, as proven within the decrypted_credit_card column.

Decrypted results

Cleansing up

To keep away from incurring future costs, be certain to scrub up all of the AWS sources that you just created as a part of this put up.

You’ll be able to delete the CloudFormation stack on the AWS CloudFormation console or by way of the AWS Command Line Interface (AWS CLI). The default stack identify is aws-blog-redshift-column-level-encryption.

Conclusion

On this put up, we demonstrated how you can implement a customized column-level encryption resolution for Amazon Redshift, which gives an extra layer of safety for delicate information saved on the cloud information warehouse. The CloudFormation template offers you a straightforward technique to arrange the information pipeline, which you’ll be able to additional customise to your particular enterprise situations. It’s also possible to modify the AWS Glue ETL code to encrypt a number of information fields on the identical time, and to make use of completely different information encryption keys for various columns for enhanced information safety. With this resolution, you may restrict the events the place human actors can entry delicate information saved in plain textual content on the information warehouse.

You’ll be able to be taught extra about this resolution and the supply code by visiting the GitHub repository. To be taught extra about how you can use Amazon Redshift UDFs to resolve completely different enterprise issues, consult with Instance makes use of of user-defined capabilities (UDFs) and Amazon Redshift UDFs.


Concerning the Creator

Aaron ChongAaron Chong is an Enterprise Options Architect at Amazon Net Providers Hong Kong. He specializes within the information analytics area, and works with a variety of shoppers to construct massive information analytics platforms, modernize information engineering practices, and advocate AI/ML democratization.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments