Reducing ML Model Training Time Using Amazon SageMaker

Hyundai Motor Company, headquartered in Seoul, South Korea, is one of the largest car manufacturers in the world. They have been heavily investing human and material resources in the race to develop self-driving cars, also known as autonomous vehicles.

One of the algorithms often used in autonomous driving is semantic segmentation, which is a task to annotate every pixel of an image with an object class. These classes could be road, person, car, building, vegetation, sky, and so on. In a typical development cycle, the team at Hyundai Motor Company tests the accuracy periodically, and gathers additional images to correct for the insufficient predictive performance in specific situations. This can be a challenge, however, as there is often not enough time to prepare all the new data while leaving sufficient time to train the model and meet the scheduled deadlines. Together with the Amazon ML Solutions Lab, Hyundai Motor Company solved this problem by significantly accelerating training using the scalable AWS Cloud and Amazon SageMaker, including the new SageMaker library for data parallelism.

SageMaker is a fully managed ML platform that solves customer challenges by reducing the “heavy lifting” of managing distributed compute infrastructure and monitoring and debugging the training job. For this use case, we use the SageMaker data parallelism library and Amazon SageMaker Debugger to solve Hyundai Motor Company’s technical challenge and meet their business goals in a cost-effective manner.

SageMaker provides distributed training libraries for data parallelism and model parallelism. In this case, the model being trained is fit into memory on a single GPU, but the amount of training data is large, which means one epoch of training takes too long with a single GPU. This is a typical training example in which data parallel distributed training can reduce the overall duration of the training job. SageMaker data parallelism accomplishes this by spreading training data to multiple GPU instances and training the same model on each GPU using the assigned dataset. The SageMaker data parallelism library is designed to exploit the high-speed AWS network infrastructure that brings near-linear scalability with more GPUs being used.

The training architecture uses SageMaker and optionally uses Amazon FSx for Lustre for data storage. We use Amazon Simple Storage Service (Amazon S3) as permanent data storage. We converted the PyTorch Data Parallel based training code into the SageMaker data parallelism library with only a few lines of code, and achieved up to 93% scaling efficiency with 8 GPU instances, or 64 GPUs in total. The following diagram depicts the AWS architecture deployed for distributed training:

Distributed training using the SageMaker data parallelism library

To use the SageMaker data parallelism library, only a small code change is needed, which wraps the model using the SageMaker DistributedDataParallel class and performs initialization. The following code example shows how this is done to a PyTorch training script. The user experience on the API is similar to PyTorch’s DistributedDataParallel.


from smdistributed.dataparallel.torch.parallel.distributed 

import DistributedDataParallel as DDP
import smdistributed.dataparallel.torch.distributed as dist

# Initializing distributed training process group
dist.init_process_group()

# Setting the local rank as the local GPU ID
local_rank = dist.get_local_rank()

# Wrapping the model for distributed training
model = DDP(Net())
torch.cuda.set_device(local_rank)
model.cuda(local_rank)

If you’re an experienced user of PyTorch or TensorFlow distributed training, you might ask a question like “How do I set up a cluster for distributed training, and how do I start training processes on each instance in the cluster?” All you must do in SageMaker is specify the number of instances and the instance type, and let SageMaker know what distributed training strategy to use. SageMaker takes care of the heavy lifting, and this configuration is applied to both PyTorch and TensorFlow. See the following example code:


estimator = PyTorch(instance_count=4,
                    instance_type='ml.p4d.24xlarge',
                    distribution={
                        'smdistributed':{
                            'dataparallel':{
                                'enabled': True
                            }
                        }
                    })

# Using Amazon FSx for Lustre
train_fs = FileSystemInput(file_system_id='file system id',
                           file_system_type='FSxLustre',
                           directory_path='/fsx/',
                           file_system_access_mode='ro')    
estimator.fit(inputs={'train': train_fs})

# Using S3
estimator.fit(inputs={'train': 's3://your-bucket-name/prefix/'})

Training performance analysis and tuning

No training code changes are required to use Debugger. Debugger is either configured when you define the estimator, or enabled or disabled via Studio or the Debugger API while a training job is running. We enabled Debugger system profiling on CPU utilization, GPU utilization, GPU memory utilization, and I/O wait with 500-millisecond intervals via a SageMaker estimator in order to get the full picture of the training job performance. This is done by defining ProfilerConfig and setting it to the estimator:

profiler_config = ProfilerConfig(system_monitor_interval_millis=500)

estimator = PyTorch(
    ...    
    profiler_config=profiler_config)

Conclusion

We detailed the challenges with this complex use case and how we used the SageMaker data parallelism library to speed up training. We also shared real-world techniques to identify bottlenecks and optimize training performance. As a result, we achieved 10 times faster training speed with just five times as many instances.

“We use computer vision models to do scene segmentation, which is important for scene understanding,” said Jinwook Choi, Senior Research Engineer at Hyundai Motor Company. “It used to take 57 minutes to train the model for one epoch, which slowed us down. Using Amazon SageMaker’s data parallelism library and with the help of Amazon ML Solutions Lab, we were able to train in 6 minutes with optimized training code on five ml.p3.16xlarge instances. With the 10 times reduction in training time, we can spend more time preparing data during the development cycle.”

References:

Source