Skip to content

Commit

Permalink
Add exostellar infrastructure optimizer playbook
Browse files Browse the repository at this point in the history
Configure XIO

Resolves #226
  • Loading branch information
cartalla committed Oct 8, 2024
1 parent 32aa3c3 commit 543df60
Show file tree
Hide file tree
Showing 30 changed files with 1,534 additions and 8 deletions.
9 changes: 7 additions & 2 deletions docs/deployment-prerequisites.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ This page shows common prerequisites that need to be done before deployment.

The deployment process was developed and tested using Amazon Linux 2.
It has also been tested on RHEL 8 and RHEL 9.
An easy way to create a deployment instance is to use an AWS Cloud 9 desktop.
An easy way to create a deployment instance is to use AWS CloudShell.
This will give you a code editor IDE and shell environment that you can use to deploy the cluster.

If the required packages aren't installed then you will need sudo or root access on the instance.
Expand All @@ -20,7 +20,7 @@ You will needs AWS credentials that provide admin access to deploy the cluster.
Clone or download the aws-eda-slurm-cluster repository to your system.

```
git clone git@github.com:aws-samples/aws-eda-slurm-cluster.git
git clone https://github.com/aws-samples/aws-eda-slurm-cluster.git
```

## Create SNS Topic for Error Notifications (Optional but recommended)
Expand Down Expand Up @@ -229,6 +229,11 @@ when you create the file system.
If the file system already exists, then attach the appropriate security group to the network interfaces of
the file systems.

## Create Exostellar Management Server

If you're going to use Exostellar Infrastructure Optimizer (XIO) then you will need to deploy the Exostellar management server.
See the [XIO page](../exostellar-infrastructure-optimizer) for details.

## Create Configuration File

Before you deploy a cluster you need to create a configuration file.
Expand Down
33 changes: 33 additions & 0 deletions docs/exostellar-infrastructure-optimizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Exostellar Infrastructure Optimizer

[Exostellar Infrastructure Optimizer](https://exostellar.io/infrastructureoptimizer-technical-information/) (XIO) runs applications in virtual machines (VMs) on EC2 instances and dynamically relocates the VMs between instances based on availability and cost.
Long-running, stateful jobs cannot normally be run on spot instances because they can't be restarted after a spot termination.
XIO reduces this risk by predicting spot terminations and dynamically relocating the VM to an on-demand instance.
When spot capacity becomes available again, the VM can be migrated back to a spot instance.
This allows you to save up to 90% over on-demand pricing by running on spot when capacity is available.

## XIO Configuration

Refer to [Exostellar's documentation](https://docs.exostellar.io/latest/Latest/HPC-User/getting-started-installation) to make sure you have the latest instructions.

### Create IAM permissions stack

[Create the EC2 instances profiles](https://docs.exostellar.io/latest/Latest/HPC-User/getting-ready-prerequisites#GettingReady:Prerequisites-EC2InstanceProfiles).

* Download the CloudFormation template
* Create a stack using the template

### Install the Management Server

[Install the management server](https://docs.exostellar.io/latest/Latest/HPC-User/installing-management-server)

For the shared security group id use the SlurmLoginNodeSGId so that it has access to the Slurm head node.

### Configure Slurm

```
export MGMT_SERVER=10.4.130.5
export SLURM_CONF_DIR=/opt/slurm/res-eda-pc-3-10-1-rhel8-x86/etc
"I2Nsb3VkLWNvbmZpZwpydW5jbWQ6CiAgLSBbc2gsIC1jLCAibWtkaXIgLXAgL3hjb21wdXRlIl0KICAtIFtzaCwgLWMsICJtb3VudCAxNzIuMzEuMjQuNToveGNvbXB1dGUgL3hjb21wdXRlIl0KICAtIFtzaCwgLWMsICJta2RpciAtcCAvaG9tZS9zbHVybSJdCiAgLSBbc2gsIC1jLCAibW91bnQgMTcyLjMxLjI0LjU6L2hvbWUvc2x1cm0gL2hvbWUvc2x1cm0iXQogIC0gW3NoLCAtYywgInJtIC1yZiAvZXRjL3NsdXJtIl0KICAtIFtzaCwgLWMsICJsbiAtcyAveGNvbXB1dGUvc2x1cm0vIC9ldGMvc2x1cm0iXQogIC0gW3NoLCAtYywgImNwIC94Y29tcHV0ZS9zbHVybS9tdW5nZS5rZXkgL2V0Yy9tdW5nZS9tdW5nZS5rZXkiXQogIC0gW3NoLCAtYywgInN5c3RlbWN0bCByZXN0YXJ0IG11bmdlIl0KICAjIEFMV0FZUyBMQVNUIQogIC0gWwogICAgICBzaCwKICAgICAgLWMsCiAgICAgICJlY2hvIFhTUE9UX05PREVOQU1FID4gL3Zhci9ydW4vbm9kZW5hbWU7IHNjb250cm9sIHVwZGF0ZSBub2RlbmFtZT1YU1BPVF9OT0RFTkFNRSBub2RlYWRkcj1gaG9zdG5hbWUgLUlgIiwKICAgIF0KCg=="
```
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ nav:
- 'config.md'
- 'res_integration.md'
- 'soca_integration.md'
- 'exostellar-infrastructure-optimizer.md'
- 'custom-amis.md'
- 'run_jobs.md'
- 'job_preemption.md'
Expand Down
148 changes: 143 additions & 5 deletions source/cdk/cdk_slurm_stack.py
Original file line number Diff line number Diff line change
Expand Up @@ -255,6 +255,9 @@ def check_config(self):
if 'RESStackName' in self.config:
self.update_config_for_res()

if 'Xio' in self.config['slurm']:
self.update_config_for_exostellar()

if 'ErrorSnsTopicArn' not in self.config:
logger.warning(f"ErrorSnsTopicArn not set. Provide error-sns-topic-arn on the command line or ErrorSnsTopicArn in the config file to get error notifications.")

Expand Down Expand Up @@ -467,6 +470,11 @@ def check_config(self):
logger.error(f"Configured SlurmCtl instance type ({slurmctl_instance_type}) has {slurmctl_memory_in_gb} GB and needs at least 4.")
config_errors += 1

if 'Xio' in self.config['slurm']:
if self.config['slurm']['ParallelClusterConfig']['Architecture'] != 'x86_64':
logger.error("Xio is only supported on x86_64 architecture, not {self.config['slurm']['ParallelClusterConfig']['Architecture']}")
config_errors += 1

if config_errors:
exit(1)

Expand Down Expand Up @@ -618,16 +626,16 @@ def update_config_for_res(self):
message += f"\n {stack_name:32}: status={stack_statuses[stack_name]}"
logger.error(message)
exit(1)
res_dcv_security_group_id = None
self.res_dcv_security_group_id = None
list_stack_resources_paginator = cloudformation_client.get_paginator('list_stack_resources')
for stack_resource_summaries in list_stack_resources_paginator.paginate(StackName=res_vdc_stack_name):
for stack_resource_summary_dict in stack_resource_summaries['StackResourceSummaries']:
if stack_resource_summary_dict['LogicalResourceId'].startswith('vdcdcvhostsecuritygroup'):
res_dcv_security_group_id = stack_resource_summary_dict['PhysicalResourceId']
self.res_dcv_security_group_id = stack_resource_summary_dict['PhysicalResourceId']
break
if res_dcv_security_group_id:
if self.res_dcv_security_group_id:
break
if not res_dcv_security_group_id:
if not self.res_dcv_security_group_id:
logger.error(f"RES VDI security group not found.")
exit(1)

Expand Down Expand Up @@ -766,7 +774,7 @@ def update_config_for_res(self):
}
)

res_home_mount_sg_id = res_dcv_security_group_id
res_home_mount_sg_id = self.res_dcv_security_group_id
if 'AdditionalSecurityGroups' not in self.config['slurm']['SlurmCtl']:
self.config['slurm']['SlurmCtl']['AdditionalSecurityGroups'] = []
if res_home_mount_sg_id in self.config['slurm']['SlurmCtl']['AdditionalSecurityGroups']:
Expand All @@ -783,6 +791,123 @@ def update_config_for_res(self):
self.config['slurm']['InstanceConfig']['AdditionalSecurityGroups'].append(res_home_mount_sg_id)
logger.info(f"Added slurm/InstanceConfig/AdditionalSecurityGroups={res_home_mount_sg_id}")

def update_config_for_exostellar(self):
'''
Update config with information from RES stacks
Add login node security groups.
Configure /home file system.
'''
logger.info(f"Updating configuration for Exostellar")
ems_stack_name = self.config['slurm']['Xio']['ManagementServerStackName']
logger.info(f" stack: {ems_stack_name}")

# Get RES environment name from stack parameters.
cloudformation_client = boto3.client('cloudformation', region_name=self.config['Region'])
ems_stack_name_found = False
stack_statuses = {}
stack_dicts = {}
list_stacks_paginator = cloudformation_client.get_paginator('list_stacks')
list_stacks_kwargs = {
'StackStatusFilter': [
'CREATE_COMPLETE',
'ROLLBACK_COMPLETE',
'UPDATE_COMPLETE',
'UPDATE_ROLLBACK_COMPLETE',
'IMPORT_COMPLETE',
'IMPORT_ROLLBACK_COMPLETE'
]
}
for list_stacks_response in list_stacks_paginator.paginate(**list_stacks_kwargs):
for stack_dict in list_stacks_response["StackSummaries"]:
stack_name = stack_dict['StackName']
if stack_name == ems_stack_name:
ems_stack_name_found = True
# Don't break here so get all of the stack names
stack_status = stack_dict['StackStatus']
stack_statuses[stack_name] = stack_status
stack_dicts[stack_name] = stack_dict
if not ems_stack_name_found:
message = f"CloudFormation EMS stack named {ems_stack_name} not found. Existing stacks:"
for stack_name in sorted(stack_statuses):
message += f"\n {stack_name:32}: status={stack_statuses[stack_name]}"
logger.error(message)
exit(1)

# Get values from stack parameters
stack_parameters = cloudformation_client.describe_stacks(StackName=ems_stack_name)['Stacks'][0]['Parameters']
ems_vpc_id = None
ems_subnet_id = None
for stack_parameter_dict in stack_parameters:
if stack_parameter_dict['ParameterKey'] == 'VPCId':
ems_vpc_id = stack_parameter_dict['ParameterValue']
elif stack_parameter_dict['ParameterKey'] == 'SubnetId':
ems_subnet_id = stack_parameter_dict['ParameterValue']
if not ems_vpc_id:
logger.error(f"VPCId parameter not found in {ems_stack_name} EMS stack.")
exit(1)
if not ems_subnet_id:
logger.error(f"SubnetId parameter not found in {ems_stack_name} EMS stack.")
exit(1)
if self.config['VpcId'] != ems_vpc_id:
logger.error(f"Config file VpcId={self.config['VpcId']} is not the same as EMS {ems_stack_name} VPCId={ems_vpc_id}.")
exit(1)

# Get values from stack resources
exostellar_role = None
exostellar_instance_profile = None
exostellar_security_group = None
list_stack_resources_paginator = cloudformation_client.get_paginator('list_stack_resources')
for stack_resource_summaries in list_stack_resources_paginator.paginate(StackName=ems_stack_name):
for stack_resource_summary_dict in stack_resource_summaries['StackResourceSummaries']:
if stack_resource_summary_dict['LogicalResourceId'] == 'ExostellarRole':
exostellar_role = stack_resource_summary_dict['PhysicalResourceId']
elif stack_resource_summary_dict['LogicalResourceId'] == 'ExostellarInstanceProfile':
exostellar_instance_profile = stack_resource_summary_dict['PhysicalResourceId']
elif stack_resource_summary_dict['LogicalResourceId'] == 'ExostellarSecurityGroup':
exostellar_security_group = stack_resource_summary_dict['PhysicalResourceId']
if exostellar_role and exostellar_instance_profile and exostellar_security_group:
break
if exostellar_role and exostellar_instance_profile and exostellar_security_group:
break
if not exostellar_role:
logger.error(f"ExostellarRole resource not found in {ems_stack_name} EMS stack")
exit(1)
if not exostellar_instance_profile:
logger.error(f"ExostellarInstanceProfile resource not found in {ems_stack_name} EMS stack")
exit(1)
if not exostellar_security_group:
logger.error(f"ExostellarSecurityGroup resource not found in {ems_stack_name} EMS stack")
exit(1)
# @BUG This doesn't work because PARTITION and ACCOUNT_ID are tokens that aren't resolved here.
# self.config['slurm']['Xio']['ControllerIdentityRole'] = f"arn:{Aws.PARTITION}:iam::{Aws.ACCOUNT_ID}:instance-profile/{exostellar_instance_profile}"
# self.config['slurm']['Xio']['WorkerIdentityRole'] = f"arn:{Aws.PARTITION}:iam::{Aws.ACCOUNT_ID}:instance-profile/{exostellar_instance_profile}"
if 'ControllerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'] = []
if 'WorkerSecurityGroupIds' not in self.config['slurm']['Xio']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'] = []
if exostellar_security_group not in self.config['slurm']['Xio']['ControllerSecurityGroupIds']:
self.config['slurm']['Xio']['ControllerSecurityGroupIds'].append(exostellar_security_group)
if exostellar_security_group not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(exostellar_security_group)
if self.slurm_compute_node_sg_id:
if self.slurm_compute_node_sg_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.slurm_compute_node_sg_id)
if self.res_dcv_security_group_id:
if self.res_dcv_security_group_id not in self.config['slurm']['Xio']['WorkerSecurityGroupIds']:
self.config['slurm']['Xio']['WorkerSecurityGroupIds'].append(self.res_dcv_security_group_id)

# Get values from stack outputs
ems_ip_address = None
stack_output_dicts = cloudformation_client.describe_stacks(StackName=ems_stack_name)['Stacks'][0]['Outputs']
for stack_output_dict in stack_output_dicts:
if stack_output_dict['OutputKey'] == '2ExostellarMgmtServerPrivateIP':
ems_ip_address = stack_output_dict['OutputValue']
if not ems_ip_address:
logger.error(f"2ExostellarMgmtServerPrivateIP output not found in {ems_stack_name} EMS stack.")
exit(1)
self.config['slurm']['Xio']['ManagementServerIp'] = ems_ip_address

def create_parallel_cluster_assets(self):
# Create a secure hash of all of the assets so that changes can be easily detected to trigger cluster updates.
self.assets_hash = sha512()
Expand Down Expand Up @@ -966,6 +1091,7 @@ def create_parallel_cluster_assets(self):
'config/bin/on_compute_node_configured.sh',
'config/bin/submitter_configure.sh',
'config/bin/submitter_deconfigure.sh',
'config/bin/xio-compute-node-ami-configure.sh',
'config/users_groups.json',
]
self.custom_action_s3_urls = {}
Expand Down Expand Up @@ -1984,6 +2110,7 @@ def get_instance_template_vars(self, instance_role):
instance_template_vars['parallel_cluster_munge_version'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
instance_template_vars['parallel_cluster_python_version'] = get_PARALLEL_CLUSTER_PYTHON_VERSION(self.config)
instance_template_vars['primary_controller'] = True
instance_template_vars['slurm_uid'] = self.config['slurm']['SlurmUid']
instance_template_vars['slurmctld_port'] = self.slurmctld_port
instance_template_vars['slurmctld_port_min'] = self.slurmctld_port_min
instance_template_vars['slurmctld_port_max'] = self.slurmctld_port_max
Expand All @@ -1993,6 +2120,17 @@ def get_instance_template_vars(self, instance_role):
instance_template_vars['slurmrestd_socket_dir'] = '/opt/slurm/com'
instance_template_vars['slurmrestd_socket'] = f"{instance_template_vars['slurmrestd_socket_dir']}/slurmrestd.socket"
instance_template_vars['slurmrestd_uid'] = self.config['slurm']['SlurmCtl']['SlurmrestdUid']
if 'Xio' in self.config['slurm']:
instance_template_vars['xio_mgt_ip'] = self.config['slurm']['Xio']['ManagementServerIp']
instance_template_vars['xio_availability_zone'] = self.config['slurm']['Xio']['AvailabilityZone']
instance_template_vars['xio_controller_identity_role'] = self.config['slurm']['Xio']['ControllerIdentityRole']
instance_template_vars['xio_controller_security_group_ids'] = self.config['slurm']['Xio']['ControllerSecurityGroupIds']
instance_template_vars['subnet_id'] = self.config['SubnetId']
instance_template_vars['xio_controller_image_id'] = self.config['slurm']['Xio']['ControllerImageId']
instance_template_vars['xio_worker_identity_role'] = self.config['slurm']['Xio']['WorkerIdentityRole']
instance_template_vars['xio_worker_security_group_ids'] = self.config['slurm']['Xio']['WorkerSecurityGroupIds']
instance_template_vars['xio_worker_image_id'] = self.config['slurm']['Xio']['WorkerImageId']
instance_template_vars['xio_config'] = self.config['slurm']['Xio']
elif instance_role == 'ParallelClusterSubmitter':
instance_template_vars['slurm_version'] = get_SLURM_VERSION(self.config)
instance_template_vars['parallel_cluster_munge_version'] = get_PARALLEL_CLUSTER_MUNGE_VERSION(self.config)
Expand Down
39 changes: 39 additions & 0 deletions source/cdk/config_schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -769,6 +769,45 @@ def get_config_schema(config):
Optional('Partition', default='onprem'): str,
}
},
Optional('Xio'): {
Optional('ManagementServerStackName'): str,
Optional('ManagementServerIp'): str,
'PartitionName': str,
'Profiles': [
{
'ProfileName': str,
'NodeGroupName': str,
'MaxControllers': int,
'InstanceTypes': [
str
],
'SpotFleetTypes': [
str
],
'EnableHyperthreading': bool
}
],
'Pools': [
{
'PoolName': str,
'ProfileName': str,
'PoolSize': int,
'CPUs': int,
'ImageName': str,
'MinMemory': int,
'MaxMemory': int,
Optional('VolumeSize', default=10): int
}
],
Optional('ManagementServerImageId'): str,
Optional('AvailabilityZone'): str,
Optional('ControllerIdentityRole'): str,
Optional('ControllerSecurityGroupIds'): [ str ],
Optional('ControllerImageId'): str,
Optional('WorkerIdentityRole'): str,
Optional('WorkerSecurityGroupIds'): [ str ],
Optional('WorkerImageId'): str,
},
Optional('SlurmUid', default=401): int,
Optional('storage'): {
#
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ if ! [ -e $submitter_config_dir ]; then
rm -rf $PLAYBOOKS_PATH
mkdir -p $PLAYBOOKS_PATH
pushd $PLAYBOOKS_PATH
yum -y install unzip
unzip $PLAYBOOKS_ZIP_PATH
chmod -R 0700 $ANSIBLE_PATH
popd
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ config_scripts=(\
on_compute_node_configured.sh \
submitter_configure.sh \
submitter_deconfigure.sh \
xio-compute-node-ami-configure.sh \
)
for config_script in ${config_scripts[*]}; do
dest=$config_bin_dir/$config_script
Expand Down
Loading

0 comments on commit 543df60

Please sign in to comment.