Featured image of post Build me a secure SaaS app

Build me a secure SaaS app

Take home assignment gone wrong

When applying for an IT engineering job, meeting a few people at the target company and answering some technical questions is standard practice. Things get a bit more interesting when you have to present a security roadmap to all the company’s technical leads.

A few months ago, I applied to a security engineering role at a French scale-up, let’s call it ACME. I went through the first couple of interviews and everything was going fine. For the third one, I was sent a case study to complete beforehand.

Your company has multiple front-end applications that expose private customer’s data.

Those applications are served by multiple back-end micro-services hosted on an AWS infrastructure using Docker and ECS, in a single region.

The applications make use of static assets served by an S3 bucket.

There’s no strong certification to be compliant with, but the main objective is to be as close as possible to industry standards (SaaS B2B, web application, distributed worldwide).

Going from scratch, propose and apply an organization’s data privacy principles on:

  • Application level
    • Authentication mechanism
    • Configuration and secrets management strategy
    • Logging strategy
  • Service to service communication
    • Apply end to end encryption
  • Customer data management
    • Encryption at-rest and how to implement it on all layers
  • Network infrastructure
    • Cloud network architecture
    • Offices network architecture
  • IT acceptable use policy
  • Physical facilities access policy
  • Incident response management on data breach

One possible output would be a presentation (15 minutes max) with a few slides on the aforementioned topics to explain the choices you made (diagrams are useful too). The goal is to share guidelines with ACME’s team in order to prepare for implementation next year! Giving context and arguments will also help improve the company’s security culture.

Feel free to ask as many questions as you want to understand the problem and let’s schedule a call in the next few days for you to answer these questions.

This “case” covers a very wide range of topics and goes into the technical nitty gritty of building a cloud based SaaS app’s security paradigm. Preparing an answer takes much more than the 2 hours budgeted by ACME, and the answer definitely won’t fit in a 15 minute presentation.

It took me 2 days to prepare the slide deck, so I’m sharing my work with you.

ℹ️ Disclaimer. Each organization has specific requirements and resources, a one-size-fits-all approach will not work.

The bullet points below are a collection of best practices that are broadly applicable to a software service delivery infrastructure running in a cloud environment. This is NOT a blueprint of what should be implemented in your organization. It does not take any resource constraints into consideration and tries to cover as many topics as possible.

If you are looking to replicate the suggestions made in this post in your organization, remember that security is a team sport: one person cannot cover all of these issues.

And finally, this was written in May 2020, if new products, tools, or best practices are released, or if you’re an expert in a particular subdomain and have ideas on improving a company’s security, do share them with me so I can update this post.

Now, without further ado, let’s dive right in.

Overview

Improving an SMB’s security stance requires sustained efforts on multiple fronts

GRC, keeping on top of

  • Legal obligations
  • Compliance obligations
  • Contractual obligations

Tech-wise

  • Keeping up with quickly changing technologies
  • Implementing GRC requirements
  • Reducing overhead due to security measures and improving workflow integration

IT security is not a state, it’s a discipline

Starting with GRC

1. Identify regulatory, contractual and compliance requirements. You might need to comply with

  • GDPR
  • Legislation applicable to ACME’s business environment
  • Security standards (ISO27001/17/18, SOC1/2/3, CSA Star, HDS etc.)
  • Contractual requirements are highly variable depending on the client, but usually have a common set or rules

2. Risk assessment

  • Given the GRC context, identify risks weighing on your business and data
  • Your clients might ask you for this

3. Write everything down (non-exhaustive list)

  • Information Systems Security Policy sets security requirements (availability, integrity, confidentiality, proof) and how to implement them
  • Business Continuity Plan
  • Personal Data Security Policy sets personal data handling rules within ACME
  • Privacy Statement clarifies ACME’s handling of personal data to clients and their respective clients if necessary
  • IT Acceptable Use Policy

IT Acceptable Use Policy

  • Signed by employees as they join ACME
  • Separate from the Information Systems Security Policy
  • Sets IT resource usage rules for ACME employees
  • Sets simple rules, including some simple security rules applying to all employees
    • “Don’t use company resources for illegal/immoral activities”
    • “Don’t disclose company data”
    • “Don’t store company/client data on the USB you give to your son/daughter”
    • “Don’t hack us”

Incident response process

  • Assess
    • Understand what happened, extent of damage
    • If necessary, set up crisis management unit (CISO, DPO, Account Manager, IT experts, management, etc.)
    • Determine next steps
      • Call clients?
      • (Have clients) call Data Protection authority?
      • External help? (legal, IT forensics, etc.)
      • Save/dump data for legal proof/later forensics
  • Contain
    • Limit/stop damage within business requirements (availability, data loss etc.)
  • Remediate
    • Identify root causes and possible solutions
    • Implement solutions and monitor effectiveness
  • Improve

Physical spaces

  • All services (internal and client facing) are hosted on the cloud
  • No data, source code, or infrastructure hosted on premises
  • BeyondCorp zero-trust approach, office LAN is a mere internet access
  • Visitors allowed only when identified and accompanied
  • 24/7 surveillance, alarm system, and on-site security for employees’ physical safety

Miscellaneous

Ensure

  • The DPO is involved in the development of new features that impact data privacy
  • Personal data is labelled as such, wherever it may be stored on the IT infrastructure
  • Personal data has an expiration date beyond which it’s automatically deleted
  • Security audits are carried out regularly
    • Annually
    • Clients will ask for the latest audit/pentest results
  • Employees’ access accounts are adequately managed (and revoked)
  • Slack’s free tier is not used, it’s a major compliance risk

Networking

AWS Account architecture

  • Production account
    • Production VPC and assets
  • Development account
    • Staging VPC and assets
    • Separate demo environment
    • Dev sandbox VPC
  • Security/logs account
    • Used to store logs from all accounts
  • Master account

Static content

  • Publicly accessible S3 bucket serves assets and the front-end framework
  • CloudFront CDN to reduce worldwide latency for static assets
  • Careful not to store any important information here

Production VPC

  • Every subnet has its own default SG and NACL
  • Add endpoints depending on subnet use (S3, STS, SQS, RDS etc.)
  • Accessing AWS console, APIs and resources requires VPN + MFA
  • Staging VPC in separate account deployed identically thanks to Terraform

Production VPC simple diagram

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 +---------Internet GW-----------VPN TGW-------------------------------+
 |              |                    |                                 |
 |         LB / API GW               |                                 |
 |              +--------------------)-----------------+               |
 |              |                    |                 |               |
 |  +----------------------+  +-----------+  +----------------------+  |
 |  |                      |  |           |  |                      |  |
 |  |   Public subnet      |  |           |  |    Public subnet     |  |
 |  |                      |  |           |  |                      |  |
 |  +----------------------+  |           |  +----------------------+  |
 |                            |  Bastion  |                            |
 |  +----------------------+  |   EC2     |  +----------------------+  |
 |  |                      |  |           |  |                      |  |
 |  |   Private (apps)     |  | Admin     |  |    Private (apps)    |  |
 |  |                      |  | Subnet    |  |                      |  |
 |  +----------------------+  +-----------+  +----------------------+  |
 |                                                                     |
 |  +----------------------+  +-----------+  +----------------------+  |
 |  |                      |  |           |  |                      |  |
 |  |   Private (data)     |  |           |  |    Private (data)    |  |
 |  |                      |  |           |  |                      |  |
 |  +----------------------+  |           |  +----------------------+  |
 |                            |           |                            |
 |  +----------------------+  |           |  +----------------------+  |
 |  |                      |  |           |  |                      |  |
 |  |   Protected (data)   |  |Monitoring |  |    Protected (data)  |  |
 |  |   Optional           |  | Subnet    |  |    Optional          |  |
 |  +----------------------+  +-----------+  +----------------------+  |
 |        AZ 1                                       AZ 2              |
 |                              AWS Region                             |
 +---------------------------------------------------------------------+

DDoS

  • Necessary to avoid service outage from a small attack
  • AWS, CloudFlare, OVH, 6Cure
  • Different technologies for different use cases

WAF

Offices

  • No LAN resources, just an internet access
  • AWS service delivery infrastructure accessible via VPN
  • Business applications, IM & file sharing -> Accessible online with ACME SSO account
  • On-site network security necessary
    • Preventing arpspoof, rogue DHCPs, VLAN hopping and other network attacks with modern network equipment
    • On-site recursive DNS server and monitoring outbound DNS requests
    • Monitor FW
    • On-site NIDS Suricata (on a dedicated VLAN)
    • Everything routes back to corporate SIEM via a VPN
    • Going one step further : Per user VLAN & ACME LDAP-based 802.1x

Authentication

Admins

  • LDAP+SSO of some sorts (GSuite LDAP, Okta etc)
  • Long passwords
  • VPN + MFA to access production with STS
  • Passwords management solution for all ACME employees, eg

Users

  • ACME apps must provide logins through
    • built-in accounts
      • Long passwords
      • MFA if the client wants it
    • SSO integration with clients’ Directories (client handles authentication)

Securing data end to end

  • Users connect to the internet facing API Gateway (reverse proxy) with TLS (with strong cipher suites)
  • Internal PKI allows inter-service TLS encrypted communication with mutual auth for synchronous HTTP calls
  • Asynchronous calls (RabbitMQ, SQS etc) can also be encrypted in transit and at rest
  • Encryption certs/keys management must be automated
  • Databases (RDS) can also encrypt data (storage encryption)
  • Database encryption (non-storage) may carry a performance penalty
  • S3 stored objects can be encrypted with SSE or KMS CMK
  • EBS & EFS volumes (for EC2s or ECS) must be encrypted (eg. with SSE)

Logs

ACME’s apps’ logs

  • Consistent structure across all apps and logs
    • Use precise timestamps (ntp sync’d)
    • Error IDs
    • JSON format
  • Log levels (INFO, DEBUG, WARN, ERROR)
  • Human (and non dev) readable description is a must
  • No personal data in logs thanks to developer awareness

Infrastructure Logs

  • Log all the things to S3 in the log account. Use SSE & lifecycles
  • Main AWS services:
    • Cloudtrail
    • Cloudwatch Logs
    • Config
    • ELBs
    • Lambdas
    • etc.
  • Use a syslog server (eg. fluentd) for
    • EC2
    • Containers
  • Channel all the logs to a SIEM (eg. Splunk, ELK SIEM) (see previous blog post)
    • Configure SIEM alerts & dashboards

SIEM architecture

  • Collection / Indexing / Viewing
  • Collectors close to the resources
  • Indexers are clustered, replicated and fault tolerant
    • Data stored persistently, replicated across AZ
    • Indexing machines part of immutable architecture
    • Indexer location must comply with GRC requirements
  • Viewers (eg. Kibana, or Splunk Search Heads) replicated and fault tolerant
    • Alerts are vital for a SOC to function properly

Backups & BCP

As part of the Business Continuity Plan / Disaster Recovery Plan

  • RDS
    • RDS backed-up to multiple different AZ, or regions
    • Test RDS backup restoration in the staging environment
  • S3
    • S3 is highly fault tolerant, can be backed up to Glacier if really necessary (data lifecycles still apply)
  • EBS / EFS
  • Code and machines
    • Use immutable infrastructure with Terraform/Cloudformation, Ansible/Saltstack, Packer etc.
      • Machine images (AMI, docker registry images) can be easily redeployed
    • Save code and config files on a version control service
      • If self hosted, back it up to external location
      • If SaaS, ensure contract is acceptable, and still back it up to an external location

Platform security

Security tools in software production chain

  • Use a reputable security framework to handle security sensitive workloads like RBAC, authentication/2FA etc (eg. Spring Security, Yosai etc.)
  • Use SonarQube and select security rules/plugins to check code security, dependencies and configurations
  • Include security libraries/dependencies if applicable (eg. Sqreen)
  • Use Anchore / Clair for container image static analysis
  • Run Nessus and additional fuzzers on staging environment
  • Sign Docker images (DCT) if untrusted registry
  • Plug all security tooling reports into the SIEM

Tight RBAC

  • Use least-privilege RBAC everywhere, eg:
    • AWS IAM for admins, services
    • Bucket policies, KMS policies etc
    • On the network level

Configuration management

  • Use Git
  • Use Terraform to deploy AWS resources
  • Use Ansible and Packer to apply configurations to machines and prepare AMIs
  • Use a CI (eg. Gitlab CI) to build docker images
  • Use AWS Config to check what’s deployed against what’s architected
  • Use AWS Nuke to destroy resources in the dev sandbox and reduce cost / exposure

Secrets management

  • Mozilla Sops to encrypt secrets in git files with KMS or GPG
    • Set RBAC with IAM and KMS
    • Use Terraform Sops provider to automatically decrypt secrets on TF apply
  • Ansible Vault to encrypt secrets in git files
    • Ansible automatically unvaults its secrets on playbook run
  • Generic/service accounts are necessary
    • For admins, use central secret management solution (eg. bitwarden, dashlane)
    • For machines, use Vault
    • Use automatic secret rotation when possible
  • People share secrets in instant messages
    • Don’t use Slack’s free tier, prefer a self-hosted Mattermost instance

Bastion

  • Set up a bastion host for all access to production and staging environments
    • No resource (EC2, RDS etc) can be SSH’d into without going through the bastion
  • Log all admin actions to the SIEM
  • Automatically blocks suspicious actions

Security watch

OSINT

  • Continuously scan the open web for misuse of ACME information and identity
  • Detect “shadow” IT assets
  • Detect potential scammers, phishers using ACME-looking traps
  • Spiderfoot is a useful example

Security newsfeeds

  • Information about the latest attacks and vulnerabilities
    • Useful for keeping on top of CVEs and updates
    • Useful for keeping up-to-date with changing threat landscape

Threat Intelligence

  • Use open TI feeds to supply information to in in-house TI platform

Intrusion Prevention Systems

Going one step further

  • Include OSSEC (an IPS) on all EC2s and run an OSSEC master
    • Useful for file integrity checking, alerting, and automated response
  • Include netsniff-ng on all EC2, and route the traffic through an encrypted connection to a Suricata (NIDS) server
  • HIDS for docker containers
    • Monitor the host with OSSEC or Falco (doesn’t work with Fargate)
    • Can run in a side car docker (eg. Falco)
  • Must route back to the SIEM

Final thoughts

  • IT Security is the discipline of understanding your weak points and shielding them
  • These tools, if implemented, need to be used for the security stance to actually improve
  • It’s necessarily a company wide effort, requiring awareness and implication from
    • Management
    • Engineering
    • Business operations