Saturday, September 24, 2022
HomeBig DataDesign concerns for Amazon EMR on EKS in a multi-tenant Amazon EKS...

Design concerns for Amazon EMR on EKS in a multi-tenant Amazon EKS setting

Many AWS prospects use Amazon Elastic Kubernetes Service (Amazon EKS) to be able to benefit from Kubernetes with out the burden of managing the Kubernetes management aircraft. With Kubernetes, you’ll be able to centrally handle your workloads and provide directors a multi-tenant setting the place they will create, replace, scale, and safe workloads utilizing a single API. Kubernetes additionally means that you can enhance useful resource utilization, cut back value, and simplify infrastructure administration to assist totally different utility deployments. This mannequin is useful for these operating Apache Spark workloads, for a number of causes. For instance, it means that you can have a number of Spark environments operating concurrently with totally different configurations and dependencies which can be segregated from one another by means of Kubernetes multi-tenancy options. As well as, the identical cluster can be utilized for numerous workloads like machine studying (ML), host functions, knowledge streaming and thereby lowering operational overhead of managing a number of clusters.

AWS presents Amazon EMR on EKS, a managed service that lets you run your Apache Spark workloads on Amazon EKS. This service makes use of the Amazon EMR runtime for Apache Spark, which will increase the efficiency of your Spark jobs in order that they run quicker and value much less. Once you run Spark jobs on EMR on EKS and never on self-managed Apache Spark on Kubernetes, you’ll be able to benefit from automated provisioning, scaling, quicker runtimes, and the event and debugging instruments that Amazon EMR gives

On this publish, we present easy methods to configure and run EMR on EKS in a multi-tenant EKS cluster that may utilized by your numerous groups. We deal with multi-tenancy by means of 4 subjects: community, useful resource administration, value administration, and safety.


All through this publish, we use terminology that’s both particular to EMR on EKS, Spark, or Kubernetes:

  • Multi-tenancy – Multi-tenancy in Kubernetes can are available in three varieties: laborious multi-tenancy, delicate multi-tenancy and sole multi-tenancy. Onerous multi-tenancy means every enterprise unit or group of functions will get a devoted Kubernetes; there is no such thing as a sharing of the management aircraft. This mannequin is out of scope for this publish. Mushy multi-tenancy is the place pods would possibly share the identical underlying compute useful resource (node) and are logically separated utilizing Kubernetes constructs by means of namespaces, useful resource quotas, or community insurance policies. A second strategy to obtain multi-tenancy in Kubernetes is to assign pods to particular nodes which can be pre-provisioned and allotted to a selected staff. On this case, we discuss sole multi-tenancy. Until your safety posture requires you to make use of laborious or sole multi-tenancy, you’ll need to think about using delicate multi-tenancy for the next causes:
    • Mushy multi-tenancy avoids underutilization of sources and waste of compute sources.
    • There’s a restricted variety of managed node teams that can be utilized by Amazon EKS, so for big deployments, this restrict can rapidly turn out to be a limiting issue.
    • In sole multi-tenancy there’s excessive probability of ghost nodes with no pods scheduled on them as a result of misconfiguration as we drive pods into devoted nodes with label, taints and tolerance and anti-affinity guidelines.
  • Namespace – Namespaces are core in Kubernetes and a pillar to implement delicate multi-tenancy. With namespaces, you’ll be able to divide the cluster into logical partitions. These partitions are then referenced in quotas, community insurance policies, service accounts, and different constructs that assist isolate environments in Kubernetes.
  • Digital cluster – An EMR digital cluster is mapped to a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR makes use of digital clusters to run jobs and host endpoints. A number of digital clusters may be backed by the identical bodily cluster. Nevertheless, every digital cluster maps to at least one namespace on an EKS cluster. Digital clusters don’t create any energetic sources that contribute to your invoice or require lifecycle administration exterior the service.
  • Pod template – In EMR on EKS, you’ll be able to present a pod template to manage pod placement, or outline a sidecar container. This pod template may be outlined for executor pods and driver pods, and saved in an Amazon Easy Storage Service (Amazon S3) bucket. The S3 areas are then submitted as a part of the applicationConfiguration object that’s a part of configurationOverrides, as outlined within the EMR on EKS job submission API.

Safety concerns

On this part, we deal with safety from totally different angles. We first focus on easy methods to shield IAM function that’s used for operating the job. Then deal with easy methods to shield secrets and techniques use in jobs and eventually we focus on how one can shield knowledge whereas it’s processed by Spark.

IAM function safety

A job submitted to EMR on EKS wants an AWS Id and Entry Administration (IAM) execution function to work together with AWS sources, for instance with Amazon S3 to get knowledge, with Amazon CloudWatch Logs to publish logs, or use an encryption key in AWS Key Administration Service (AWS KMS). It’s a greatest follow in AWS to use least privilege for IAM roles. In Amazon EKS, that is achieved by means of IRSA (IAM Function for Service Accounts). This mechanism permits a pod to imagine an IAM function on the pod degree and never on the node degree, whereas utilizing short-term credentials which can be supplied by means of the EKS OIDC.

IRSA creates a belief relationship between the EKS OIDC supplier and the IAM function. This methodology permits solely pods with a service account (annotated with an IAM function ARN) to imagine a job that has a belief coverage with the EKS OIDC supplier. Nevertheless, this isn’t sufficient, as a result of it might enable any pod with a service account inside the EKS cluster that’s annotated with a job ARN to imagine the execution function. This have to be additional scoped down utilizing circumstances on the function belief coverage. This situation permits the assume function to occur provided that the calling service account is the one used for operating a job related to the digital cluster. The next code reveals the construction of the situation so as to add to the belief coverage:

    "Model": "2012-10-17",
    "Assertion": [
            "Effect": "Allow",
            "Principal": {
                "Federated": <OIDC provider ARN >
            "Action": "sts:AssumeRoleWithWebIdentity"
            "Condition": { "StringLike": { “<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>”} }

To scope down the belief coverage utilizing the service account situation, it’s worthwhile to run the next the command with AWS CLI:

aws emr-containers update-role-trust-policy 
–cluster-name cluster 
–namespace namespace 
–role-name iam_role_name_for_job_execution

The command will the add the service account that can be utilized by the spark shopper, Jupyter Enterprise Gateway, Spark kernel, driver or executor. The service accounts identify have the next construction emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>.

Along with the function segregation supplied by IRSA, we advocate blocking entry to occasion metadata as a result of a pod can nonetheless inherit the rights of the occasion profile assigned to the employee node. For extra details about how one can block entry to metadata, consult with Limit entry to the occasion profile assigned to the employee node.

Secret safety

Someday a Spark job must eat knowledge saved in a database or from APIs. More often than not, these are protected with a password or entry key. The most typical strategy to go these secrets and techniques is thru setting variables. Nevertheless, in a multi-tenant setting, this implies any person with entry to the Kubernetes API can doubtlessly entry the secrets and techniques within the setting variables if this entry isn’t scoped effectively to the namespaces the person has entry to.

To beat this problem, we advocate utilizing a Secrets and techniques retailer like AWS Secrets and techniques Supervisor that may be mounted by means of the Secret Retailer CSI Driver. The advantage of utilizing Secrets and techniques Supervisor is the flexibility to make use of IRSA and permit solely the function assumed by the pod entry to the given secret, thereby enhancing your safety posture. You’ll be able to consult with the greatest practices information for pattern code exhibiting using Secrets and techniques Supervisor with EMR on EKS.

Spark knowledge encryption

When a Spark utility is operating, the driving force and executors produce intermediate knowledge. This knowledge is written to the node native storage. Anybody who is ready to exec into the pods would be capable to learn this knowledge. Spark helps encryption of this knowledge, and it may be enabled by passing --conf As a result of this configuration provides efficiency penalty, we advocate enabling knowledge encryption just for workloads that retailer and entry extremely delicate knowledge and in untrusted environments.

Community concerns

On this part we focus on easy methods to handle networking inside the cluster in addition to exterior the cluster. We first deal with how Spark deal with cross executors and driver communication and easy methods to safe it. Then we focus on easy methods to limit community site visitors between pods within the EKS cluster and permit solely site visitors destined to EMR on EKS. Final, we focus on easy methods to limit site visitors of executors and driver pods to exterior AWS service site visitors utilizing safety teams.

Community encryption

The communication between the driving force and executor makes use of RPC protocol and isn’t encrypted. Beginning with Spark 3 within the Kubernetes backed cluster, Spark presents a mechanism to encrypt communication utilizing AES encryption.

The driving force generates a key and shares it with executors by means of the setting variable. As a result of the secret’s shared by means of the setting variable, doubtlessly any person with entry to the Kubernetes API (kubectl) can learn the important thing. We advocate securing entry in order that solely approved customers can have entry to the EMR digital cluster. As well as, it is best to arrange Kubernetes role-based entry management in such a means that the pod spec within the namespace the place the EMR digital cluster runs is granted to just a few chosen service accounts. This methodology of passing secrets and techniques by means of the setting variable would change sooner or later with a proposal to make use of Kubernetes secrets and techniques.

To allow encryption, RPC authentication should even be enabled in your Spark configuration. To allow encryption in-transit in Spark, it is best to use the next parameters in your Spark config:

--conf spark.authenticate=true


Notice that these are the minimal parameters to set; consult with Encryption from the entire checklist of parameters.

Moreover, making use of encryption in Spark has a destructive influence on processing pace. You must solely apply it when there’s a compliance or regulation want.

Securing Community site visitors inside the cluster

In Kubernetes, by default pods can talk over the community throughout totally different namespaces in the identical cluster. This habits isn’t at all times fascinating in a multi-tenant setting. In some cases, for instance in regulated industries, to be compliant you need to implement strict management over the community and ship and obtain site visitors solely from the namespace that you just’re interacting with. For EMR on EKS, it might be the namespace related to the EMR digital cluster. Kubernetes presents constructs that assist you to implement community insurance policies and outline fine-grained management over the pod-to-pod communication. These insurance policies are applied by the CNI plugin; in Amazon EKS, the default plugin could be the VPC CNI. A coverage is outlined as follows and is utilized with kubectl:

Variety: NetworkPolicy
  identify: default-np-ns1
  namespace: <EMR-VC-NAMESPACE>
  podSelector: {}
  - Ingress
  - Egress
  - from:
    - namespaceSelector:
          nsname: <EMR-VC-NAMESPACE>

Community site visitors exterior the cluster

In Amazon EKS, while you deploy pods on Amazon Elastic Compute Cloud (Amazon EC2) cases, all of the pods use the safety group related to the node. This may be a difficulty in case your pods (executor pods) are accessing an information supply (particularly a database) that permits site visitors primarily based on the supply safety group. Database servers typically limit community entry solely from the place they’re anticipating it. Within the case of a multi-tenant EKS cluster, this implies pods from different groups that shouldn’t have entry to the database servers, would be capable to ship site visitors to it.

To beat this problem, you should utilize safety teams for pods. This function means that you can assign a selected safety group to your pods, thereby controlling the community site visitors to your database server or knowledge supply. You may also consult with the greatest practices information for a reference implementation.

Price administration and chargeback

In a multi-tenant setting, value administration is a essential topic. You might have a number of customers from numerous enterprise items, and also you want to have the ability to exactly chargeback the price of the compute useful resource they’ve used. At the start of the publish, we launched three fashions of multi-tenancy in Amazon EKS: laborious multi-tenancy, delicate multi-tenancy, and sole multi-tenancy. Onerous multi-tenancy is out of scope as a result of the fee monitoring is trivial; all of the sources are devoted to the staff utilizing the cluster, which isn’t the case for sole multi-tenancy and delicate multi-tenancy. Within the subsequent sections, we focus on these two strategies to trace the fee for every of mannequin.

Mushy multi-tenancy

In a delicate multi-tenant setting, you’ll be able to carry out chargeback to your knowledge engineering groups primarily based on the sources they consumed and never the nodes allotted. On this methodology, you utilize the namespaces related to the EMR digital cluster to trace how a lot sources had been used for processing jobs. The next diagram illustrates an instance.

Diagram -1 Mushy multi-tenancy

Monitoring sources primarily based on the namespace isn’t a simple job as a result of jobs are transient in nature and fluctuate of their length. Nevertheless, there are companion instruments obtainable that assist you to preserve observe of the sources used, similar to Kubecost, CloudZero, Vantage, and lots of others. For directions on utilizing Kubecost on Amazon EKS, consult with this weblog publish on value monitoring for EKS prospects.

Sole multi-tenancy

For sole multi-tenancy, the chargeback is finished on the occasion (node) degree. Every member in your staff makes use of a selected set of nodes which can be devoted to it. These nodes aren’t at all times operating, and are spun up utilizing the Kubernetes auto scaling mechanism. The next diagram illustrates an instance.

Diagram to illustrate Sole tenancy

Diagram -2 Sole tenancy

With sole multi-tenancy, you utilize a value allocation tag, which is an AWS mechanism that means that you can observe how a lot every useful resource has consumed. Though the strategy of sole multi-tenancy isn’t environment friendly when it comes to useful resource utilization, it gives a simplified technique for chargebacks. With the fee allocation tag, you’ll be able to chargeback a staff primarily based on all of the sources they used, like Amazon S3, Amazon DynamoDB, and different AWS sources. The chargeback mechanism primarily based on the fee allocation tag may be augmented utilizing the just lately launched AWS Billing Conductor, which lets you difficulty payments internally on your staff.

Useful resource administration

On this part, we focus on concerns concerning useful resource administration in multi-tenant clusters. We briefly focus on subjects like sharing sources graciously, setting guard rails on useful resource consumption, methods for making certain sources for time delicate and/or essential jobs, assembly fast useful resource scaling necessities and eventually value optimization practices with node selectors.

Sharing sources

In a multi-tenant setting, the objective is to share sources like compute and reminiscence for higher useful resource utilization. Nevertheless, this requires cautious capability administration and useful resource allocation to verify every tenant will get their justifiable share. In Kubernetes, useful resource allocation is managed and enforced through the use of ResourceQuota and LimitRange. ResourceQuota limits sources on the namespace degree, and LimitRange means that you can guarantee that all of the containers are submitted with a useful resource requirement and a restrict. On this part, we show how an information engineer or Kubernetes administrator can arrange ResourceQuota as a LimitRange configuration.

The administrator creates one ResourceQuota per namespace that gives constraints for combination useful resource consumption:

apiVersion: v1
sort: ResourceQuota
  identify: compute-resources
  namespace: teamA
    requests.cpu: "1000"
    requests.reminiscence: 4000Gi
    limits.cpu: "2000"
    limits.reminiscence: 6000Gi

For LimitRange, the administrator can overview the next pattern configuration. We advocate utilizing default and defaultRequest to implement the restrict and request subject on containers. Lastly, from an information engineer perspective whereas submitting the EMR on EKS jobs, it’s worthwhile to be sure the Spark parameters of useful resource necessities are inside the vary of the outlined LimitRange. For instance, within the following configuration, the request for spark.executor.cores=7 will fail as a result of the max restrict for CPU is 6 per container:

apiVersion: v1
sort: LimitRange
  identify: cpu-min-max
  namespace: teamA
  - max:
      cpu: "6"
      cpu: "100m"
      cpu: "500m"
      cpu: "100m"
    sort: Container

Precedence-based useful resource allocation

Diagram Illustrates an example of resource allocation with priority

Diagram – 3 Illustrates an instance of useful resource allocation with precedence.

As all of the EMR digital clusters share the identical EKS computing platform with restricted sources, there can be situations during which it’s worthwhile to prioritize jobs in a delicate timeline. On this case, high-priority jobs can make the most of the sources and end the job, whereas low-priority jobs which can be operating will get stopped and any new pods should wait within the queue. EMR on EKS can obtain this with the assistance of pod templates, the place you specify a precedence class for the given job.

When a pod precedence is enabled, the Kubernetes scheduler orders pending pods by their precedence and locations them within the scheduling queue. Because of this, the higher-priority pod could also be scheduled earlier than pods with decrease precedence if its scheduling necessities are met. If this pod can’t be scheduled, the scheduler continues and tries to schedule different lower-priority pods.

The preemptionPolicy subject on the PriorityClass defaults to PreemptLowerPriority, and the pods of that PriorityClass can preempt lower-priority pods. If preemptionPolicy is ready to By no means, pods of that PriorityClass are non-preempting. In different phrases, they will’t preempt another pods. When lower-priority pods are preempted, the sufferer pods get a grace interval to complete their work and exit. If the pod doesn’t exit inside that grace interval, that pod is stopped by the Kubernetes scheduler. Subsequently, there’s often a time hole between the purpose when the scheduler preempts sufferer pods and the time {that a} higher-priority pod is scheduled. If you wish to reduce this hole, you’ll be able to set a deletion grace interval of lower-priority pods to zero or a small quantity. You are able to do this by setting the terminationGracePeriodSeconds possibility within the sufferer Pod YAML.

See the next code samples for precedence class:

sort: PriorityClass
  identify: high-priority
worth: 100
globalDefault: false
description: " Excessive-priority Pods and for Driver Pods."

sort: PriorityClass
  identify: low-priority
worth: 50
globalDefault: false
description: " Low-priority Pods."

One of many key concerns whereas templatizing the driving force pods, particularly for low-priority jobs, is to keep away from the identical low-priority class for each driver and executor. This can save the driving force pods from getting evicted and lose the progress of all its executors in a useful resource congestion state of affairs. On this low-priority job instance, we’ve got used a high-priority class for driver pod templates and low-priority lessons just for executor templates. This manner, we are able to guarantee the driving force pods are protected in the course of the eviction technique of low-priority jobs. On this case, solely executors can be evicted, and the driving force can convey again the evicted executor pods because the useful resource turns into freed. See the next code:

apiVersion: v1
sort: Pod
  priorityClassName: "high-priority"
  nodeSelector: ON_DEMAND
  - identify: spark-kubernetes-driver # This can be interpreted as Spark driver container

apiVersion: v1
sort: Pod
  priorityClassName: "low-priority"
  nodeSelector: SPOT
  - identify: spark-kubernetes-executors # This can be interpreted as Spark executor container

Overprovisioning with precedence

Diagram Illustrates an example of overprovisioning with priority

Diagram – 4 Illustrates an instance of overprovisioning with precedence.

As pods wait in a pending state as a result of useful resource availability, further capability may be added to the cluster with Amazon EKS auto scaling. The time it takes to scale the cluster by including new nodes for deployment must be thought-about for time-sensitive jobs. Overprovisioning is an choice to mitigate the auto scaling delay utilizing short-term pods with destructive precedence. These pods occupy house within the cluster. When pods with excessive precedence are unschedulable, the short-term pods are preempted to make the room. This causes the auto scaler to scale out new nodes as a result of overprovisioning. Remember that this can be a trade-off as a result of it provides increased value whereas minimizing scheduling latency. For extra details about overprovisioning greatest practices, consult with Overprovisioning.

Node selectors

EKS clusters can span a number of Availability Zones in a VPC. A Spark utility whose driver and executor pods are distributed throughout a number of Availability Zones can incur inter- Availability Zone knowledge switch prices. To reduce or remove the information switch value, it is best to configure the job to run on a selected Availability Zone and even particular node sort with the assistance of node labels. Amazon EKS locations a set of default labels to determine capability sort (On-Demand or Spot Occasion), Availability Zone, occasion sort, and extra. As well as, we are able to use customized labels to fulfill workload-specific node affinity.

EMR on EKS means that you can select particular nodes in two methods:

  • On the job degree. Check with EKS Node Placement for extra particulars.
  • Within the driver and executor degree utilizing pod templates.

When utilizing pod templates, we advocate utilizing on demand cases for driver pods. You may also contemplate together with spot cases for executor pods for workloads which can be tolerant of occasional durations when the goal capability isn’t fully obtainable. Leveraging spot cases assist you to save value for jobs that aren’t essential and may be terminated. Please refer Outline a NodeSelector in PodTemplates.


On this publish, we supplied steering on easy methods to design and deploy EMR on EKS in a multi-tenant EKS setting by means of totally different lenses: community, safety, value administration, and useful resource administration. For any deployment, we advocate the next:

  • Use IRSA with a situation scoped on the EMR on EKS service account
  • Use a secret supervisor to retailer credentials and the Secret Retailer CSI Driver to entry them in your Spark utility
  • Use ResourceQuota and LimitRange to specify the sources that every of your knowledge engineering groups can use and keep away from compute useful resource abuse and hunger
  • Implement a community coverage to segregate community site visitors between pods

Lastly, if you’re contemplating migrating your spark workload to EMR on EKS you’ll be able to additional study design patterns to handle Apache Spark workload in EMR on EKS on this weblog and about migrating your EMR transient cluster to EMR on EKS on this weblog.

In regards to the Authors

author - lotfiLotfi Mouhib is a Senior Options Architect working for the Public Sector staff with Amazon Net Providers. He helps public sector prospects throughout EMEA understand their concepts, construct new companies, and innovate for residents. In his spare time, Lotfi enjoys biking and operating.

author - peter ajeebAjeeb Peter is a Senior Options Architect with Amazon Net Providers primarily based in Charlotte, North Carolina, the place he guides world monetary companies prospects to construct extremely safe, scalable, dependable, and cost-efficient functions on the cloud. He brings over 20 years of expertise expertise on Software program Growth, Structure and Analytics from industries like finance and telecom.



Please enter your comment!
Please enter your name here

Most Popular

Recent Comments