Loading. Please wait.

bt_bb_section_bottom_section_coverage_image

Scalable Cloud Infrastructure for Secure ML/AI Pipelines

DevSecOps Inc. assisted a biomechanical ML startup, by implementing a scalable, secure cloud infrastructure with Amazon ECS, RDS, auto-scaling, Secrets Manager, ML pipelines, AWS Cognito, OpenSearch, and AWS GuardDuty and Security Hub for threat detection and compliance monitoring.

Problem Statement/ Definition

Biomechanical ML startup was refining their injury risk management solution for many years, developing Machine Learning algorithms to help athletes and military personnel to avoid injuries and to ensure quick recovery from injuries by significantly reducing time of their return to duty. By the time we met with them, they were already serving a great number of happy customers. The component missing was a scalable cloud infrastructure to support execution of their customer-facing application (Movement Health Platform, MHP) and Machine Learning model training pipelines. Additionally, serving MHP for the military personnel and healthcare facilities required the implementation of rigorous security controls. The absence of a scalable and secure cloud infrastructure was an obstacle on the way of their company’s growth and held them from entering the regulated markets.

Proposed Solution & Architecture

To address these challenges and meet our client’s needs for scalability and security, we developed a cloud infrastructure based on Amazon ECS. This infrastructure provided scalable platform services and underlying virtual computing resources, and was carefully protected from external threats through the thoughtful selection and design of security components.

To ensure successful implementation, we first performed the following activities:
– Cloud Infrastructure Audit
– DevOps and DevSecOps integration strategy

The architecture of the proposed solution included the following components:
– Amazon ECS with Auto-Scaling Group of EC2 instances – for the platform’s application services
– Amazon ECS on Fargate – for the execution of scheduled and on-demand data pipelines
– Amazon RDS Aurora for PostgreSQL – a scalable and resilient database cluster
– EC2 Image Builder pipelines – for automatically producing the hardened (secure) base AMIs and container images on a predefined schedule, thus meeting the regular security patching requirement
– Amazon Cognito – as the application-level Identity Provider and a point of extensibility for integration with customer’s SSO solutions and IdP’s
– Amazon OpenSearch – as a managed application-level document storage. Separate instance of OpenSearch was installed to act as a SIEM solution for the platform.
– AWS Secrets Manager, SSM Parameter Store – for storing application-level secrets and environment variables
– Amazon CloudWatch with OpenSearch subscriptions – as a temporary storage for all account-level application and service logs, and as a source of data for CloudWatch Alarms
– AWS Security Hub – for compliance monitoring (SOC 2, HIPAA)
– AWS GuardDuty – for intelligent threat monitoring
– CI/CD pipelines for all application and data science (ML) components built on GitHub actions; with deployment capability to 5 different environments – development, staging, and 3 production environments.

The infrastructure was created using the IaC (“Infrastructure as Code”) approach following the best Secure SDLC practices, and was provisioned to 3 different AWS regions to ensure low latency access for Sparta Science customers: United States (North Virginia), United States (San Francisco), Australia (Sydney).

This infrastructure allowed engineering team to ensure successful implementation of applications and data science components. Additionally, our client enjoyed the ability to easy spin up / shut down new environments of such complex infrastructure in different geographic regions for their customers, for both demonstration and production purposes.

Outcomes of Project & Success Metrics

– Scalable and Resilient Cloud Infrastructure Architecture
– High Availability: 95%
– Efficient Multi-environment CI/CD pipelines
– Operational Observability and Alerting
– Strong Security Posture
– Well-defined RTO/RPO
– Portable Cloud Infrastructure (IaC, Terraform)

TCO Analysis

Initial costs: 3 FTE during 12 months
Ongoing development: 0.5 FTE / month
Support: 1 FTE Senior DevOps + 1 Part-time DevOps

Lessons Learned

Lesson: Implementing continuous monitoring and establishing feedback loops are essential for maintaining system health and performance.

Example: The project initially lacked robust monitoring tools, leading to undetected issues that caused system downtime. Integrating monitoring solutions later helped quickly identify and resolve problems.

Industry Vertical
Health
Use case
Backup & Recovery
Business Applications
Databases
ISV tools and technology

GitHub – source code repository
SonarQube – for source code scanning
Wazuh – as an XDR
Docker – for containerization of the app services
Terraform – for IaC