In this system design architecture post we will design a payments system. Payments systems are found across the internet for (1) maintaining a ledger of accounts, balances, and transactions; and (2) the processing of financial transactions between individuals, businesses, and banks. Though simple on the surface to customers, payment systems are complex with many failure scenarios, edge cases, and critical customer and business impact if things go wrong. These systems are critical to each party’s financial interests and their trust in a software platform.
When designing such a complex system where data consistency is of utmost importance we will emphasis designing for (1) data consistency and durability, (2) double-entry accounting (every transaction between two parties is zero-sum), (3) idempotency and exactly-once processing, and (4) immutability.
Design Principles:
- Data Consistency and Durability
- Double-Entry Accounting
- Idempotency and Exactly Once Processing
- Immutability
Lets dive into user and technical requirements, followed by high-level-design, user and data flows, persistence, performance, security, and edge cases.
Requirements
There are multiple areas of concern we will need to consider to detail out the user and technical requirements. We will need to manage accounts billable, maintain an authoritative ledger, a risk analysis engine to detect and block fraudulent activity, and a payments processing gateway to encapsulate various payment processors and provide availability if one is unavailable. As many businesses do, payment processing we will offload to third-party services specialized in these bank transactions with established relationships and integrations in place. This reduces our security and compliance requirements as well. As a platform, our payments system will hold a wallet for each user and support both a pay-in flow to add funds and a pay-out flow to receive funds. We will support real-time and scheduled payments. Let’s take a closer look at these business requirements.
User Requirements:
- Extremely high durability (99.9999%) and strong consistency, no amount of data loss or inconsistency is acceptable
- Scalable to 5 million transactions daily (~50 transactions per second)
- Support for multiple payment options - bank checking account, credit card, Paypal, and Apple Pay / Google Pay
- Support for multiple plug-and-play third-party payment processing integrations
- Support both for submitting payments to the platform and receiving payments from the platform
- Support for payments triggered by user request or using automated scheduled payments
Out-of-Scope:
- Currency exchange
- Taxes
- Analytics and data processing (see Metrics post)
- Multi-region and disaster recovery
Given we require scaling this solution to 5 million transactions a day, let’s evaluate what the throughput requirement is for each key access pattern. We will assume a 2-to-1 ratio of requests viewing funds and past payments versus submitting new payments, whether pay-in or pay-out. 5 million requests per day comes to about 50 transactions-per-second (TPS). This is not an overly massive volume of traffic and should be easily achievable with proper design.
Access Patterns:
- View funds and past payments (33.3 TPS, read)
- Submit payment, either pay-in or pay-out (16.7 TPS, write)
High Level Design
With those requirements in mind, let’s consider the components and services we will need to design. All the services within this payments system will fall under the Payments domain. Following domain-driven-design and cellular architecture we expose one service from the Payments domain, the Payment Platform Service. Internally the Payment Platform Service will integrate with multiple internal services with separate areas of responsibility. This allows separate services to be deployed independently with isolated availability, whereas the internal concepts are abstracted away from clients external to the payments domain.
Components:
- [External] Edge Router - network firewall, DDOS protection, authentication, block list
- [External] Payment Platform Service - external APIs supporting payments between parties
POST:/payments/payment/v1
- submits a new payment request, either pay-in or pay-out
- includes checkout UUID from the front end for idempotency
- { payment_id: string, user_id: string, from_acct: string, to_acct: string, amount: string, currency: string, payment_option: string }
GET:/payments/payment/v1/{uuid}
- provides a payment request status and metadata
POST:/payments/scheduled-payment/v1
- submits a new scheduled payment with frequency
GET:/payments/scheduled-payment/v1
- provides a list of scheduled payments, either pay-in or pay-out
PUT:/payments/scheduled-payment/v1/{uuid}
- updates a scheduled payment
DELETE:/payments/scheduled-payment/v1/{uuid}
- deletes a scheduled payment
- [Internal] Risk Engine - rules-based service for evaluating risk-level of a payment request
- [Internal] Payment Processing Service - abstraction layer over individual pluggable payment processing integrations, responsible for processing a single payment
- [Internal] Payment Scheduler - service for triggering automated payments on configurable frequencies
- [Internal] Event Log - write-ahead append-only log of events
- [Internal] Ledger - source of truth for transactions using double-entry accounting
- [Internal] Wallet - maintains account balances for parties
- [Internal] Reconciliation Processor - reconciles PSP settlements against internal ledger
- [3rd-Party] Payment Service Providers (PSP) - third-party services for processing payments
With the high-level components and services laid out, let’s dive into each user flow, the components involved, and the sequence of business logic. We will then proceed to low-level design and dive into persistence, performance, and edge cases.
Flows:
- Pay-In (Submit Funds)
- Pay-Out (Receive Funds)
- Risk Engine and Fraud Detection
- Reconciliation
Pay-In (Submit Funds)
The first user flow we will go over supports submitting payments to the platform (pay-ins) via user payment-submission trigger. This flow begins when a user clicks the “pay” or “checkout” button. The frontend web application contains a CSRF token to prevent cross-site scripting and a checkout UUID for idempotency which is sent in the network request. The request is sent from the front end to the Payment Platform Service through the Edge Router. The Edge Router will protect platform services from bad actors and Distributed Denial of Service (DDoS) attacks through a web application firewall (WAF) and block list. The Edge Router also authenticates the user submitting the request and injects user-based headers. Once the request reaches the Payment Platform Service, the checkout UUID is validated by the Payment Platform Service to ensure exactly-once processing and the request is entered into the Event Log. From the Payment Platform Service, one or more payment requests are sent to the Payment Processing Service. Alternatively, a scheduled payment can be submitted and stored in the Payment Scheduler Service for later processing.
When the Payment Processing Service receives each payment request it stores it in its own Payments database to track the status. One of several pluggable Payment Service Providers (PSP) are then called based on prioritization rules, location, time, availability, cost, and other factors. These requests include a payment UUID for idempotency to ensure repeat requests do not result in repeat transactions and overbilling, with the payment value encoded as a string to eliminate precision and overflow issues. PSP’s are integrated in the frontend with an embedded iframe or using a popup where the user enters either their confidential payment information directly or logs into an existing PSP-linked account. By not storing confidential payment information or connecting directly to banking infrastructure, we eliminate numerous compliance requirements and we derisk ourselves from security breaches.
PSP’s may be integrated synchronously or asynchronously. For synchronous integrations the Payment Processing Service simply waits for the PSP HTTP response. For asynchronous integrations, PSPs may require polling or may support a service-provided webhook. For the former, the PSP responds first synchronously with a UUID which we then poll for status updates. For the latter, in the PSP request we will provide a callback URL (“webhook”) which the PSP will then POST to once the payment is completed. Once the PSP completes processing the payment, the Payment Processing Service updates its own database. In the case of failure, based on the error-code, we retry via a retry-queue with an exponential-backoff retry-interval; for repeated failures we send the request to a dead-letter-queue for later automated reconciliation or manual review. Depending on the error-code we may open the circuit-breaker for that PSP to prevent additional requests to the PSP if the PSP is unavailable. In the case of success, it will return the result back to the Payment Platform Service.
Upon completion of payment from the Payment Processing Service, the Payment Platform Service updates each user’s wallet with the successful payments, as well as the Ledger. The Ledger uses double-entry accounting best practices where every transaction is net-zero with the same amount deducted from account 1 as added to account 2 and is immutable once written to. Once the entire workflow completes, a response is sent to the frontend to redirect the user to a successful payment page.
Pay-Out (Receive Funds)
The second user flow we will go over supports receiving payments from the platform (pay-outs). Similar to the pay-in user flow, user triggered pay-out requests pass through the Edge Router and into the Payment Platform Service or can be scheduled in advance via the Payment Scheduler. Upon receiving the pay-out request, the Payment Platform Service logs the request in the event log and calls the Payment Processing Service for each payment which subsequently calls one or more Payment Service Providers (PSP). When the PSP completes processing the payment, the Payment Platform Service is updated along with the frontend application.
Oftentimes pay-outs are not submitted in real-time but are submitted on a set schedule. These automated scheduled payments are supported through automated workflows which we will now go over. Pay-ins and pay-outs may be automated by users by submitting account details and frequency. The Payment Scheduler is responsible for configuring scheduled payments and then executing them accordingly. A database stores each scheduled payment configuration and a cron job executes these based on their scheduled frequencies at up to a 1-hour granularity. Within the one hour window we can distribute payment requests as needed to support scalability. When a scheduled payment reaches its next interval the Payment Scheduler submits a payment request to the Payment Processing Service via the Payment Platform Service, similar to when issuing user-triggered payment requests. Similarly, the PSP response is stored in Payment Processing Service with Event Log, Ledger, and Wallets updated.
Risk Engine and Fraud Detection
In any payments solution, fraud detection is required to minimize fraudulent activity. Advanced systems may use machine-learning to analyze past transactions to train science models which can then be inferred against as payment requests are received. In addition to ML, we can use a rules engine to configure business rules and guardrails. Each rule in this component has 3 elements: (1) an event-trigger (when the rule will be evaluated), (2) a set of logical conditions or predicates (such as specific properties with guardrails), and (3) actions to take after evaluation. The rules will follow a predefined syntax so when parsed we may turn the rules into an abstract-syntax-tree (AST) for evaluation.
Machine Learning Models:
For the ML science models, we need to consider: data ingestion, data preprocessing, training of the model, and inference against the trained model. The first step entails sourcing data from multiple locations including data lakes or data warehouses. The second step involves transforming, optimizing, and segmenting the data to be used by the model for both training and model validation. The third step takes the preprocessed data and trains the model either in (1A) batches or (1B) continuously. Lastly, our service leverages the trained model by either (2A) having the model run offline at a set cadence and pulling the results asynchronously, or (2B) querying the model as online-inference and passing in request metadata. For this solution, we will use the former option of training the model offline as a batch and the latter option of online-inference using request inputs. Each week we will retrain the model offline with the most recent data and then as requests come in we query the model to determine the risk level of a payment request.
Reconciliation
Finally, we will discuss the reconciliation of failed payments and inconsistent workflow states. Given the number of systems involved and the typically asynchronous nature of payment processor third-party integrations, we have to carefully consider each failure scenario and how we reconcile/resolve them. At every step within each of the above workflows there are multiple failure scenarios to think through. For example, the Payment Processing Service could break, the PSP may timeout, the PSP may return a status which is not captured, the writing to a users Wallet or to the Ledger may fail, so on and so forth.
To address each of the failure scenarios, we will build a workflow which runs every 1 hour, comparing the states in each internal service to ensure consistency. Further, every night we will receive a settlement document from each PSP with all transactions we submitted and their statues which will also be used for reconciliation. When an inconsistency is identified, one of three actions may be pursued by the workflow depending on the inconsistency. If it is (1) an expected category of inconsistency and has an automated solution the reconciliation service will automatically resolve the inconsistency. This is the most desirable resolution. If it is (2) an expected category of inconsistency with no automated solution, or (3) an unexpected category of inconsistency, the inconsistency is sent to a queue for human review.. We will aim to minimize actions two and three, sticking with automated reconciliation wherever possible.
We discuss more specific failure scenarios and edge-cases below in the low-level-design.
Low Level Design
Persistence:
The persistence layer for the Payment Platform Service requires extremely strong durability and consistency guarantees, in lieu of higher availability and performance. We will require ACID properties and support for transactional locking. We will need to store information on each payment request and the status of payment processing at each step. Given these requirements, for each microservice we will go with a relational DB in MySQL hosted on AWS Aurora, replicated to at least one back-up AWS region for disaster recovery.
In this exercise, we will detail two of the key SQL tables required. The first table (“Payment Request Table”) tracks payment request metadata including fields for payment UUID, user UUID, from and to account UUIDs, and payment option. The second table (“Payment Order Status Table”) tracks payment state, for example if the request to the PSP was successful or if the Ledger and Wallets were updated. This table includes fields for payment request UUID, checkout UUID, amount, currency, PSP status, Ledger status, Wallet status, and aggregate workflow status.
See appendix for table schema details.
Performance and Scaling:
Finally, we will discuss performance and scaling our solution. Each of the services are considered stateless given persisted source data is stored externally within a database and there are no long-lived connections which need to be drained. Stateless services are desirable and the compute-layer can easily scale horizontally using auto-scaling policies monitoring traffic volume, CPU utilization, memory utilization, and disk space.
Scaling the persistence layer itself is a bit more complex as this solution requires strong consistency where any replicated data is kept consistent throughout. We cannot simply use relational database read-only replicas with asynchronous replication as this would result in the replicas having slightly outdated data versus the primary write database nodes. We have several solutions to solve this strongly-consistent database scaling dilemma. The simplest solution is to have the database not return a success to the backend service until the data is replicated consistently across all replicas. Replicated database nodes can optionally be split between read and write nodes, but all nodes need to be consistent at any given time. This obviously impacts performance and scalability as now every database node needs to be updated in real-time. A more scalable consistent solution is to shard the database, for example on user-id or payment-id. Rather than replicating all data across all database nodes, this lets us store a given dataset on a subset of database nodes and only wait for that subset of nodes to reach a consistent state. This is much more scalable and performant. A third option is to have no primary nodes and use a fully-distributed leaderless replication protocol such as PAXOS or RAFT. This for example, is how AWS DynamoDB is designed.
Edge Cases:
Example failure scenarios and how they are handled:
- User clicks "pay" button multiple times
- Each request from the frontend contains a payment UUID. When the Payment Platform Service receives the request it validates it has not already processed that payment UUID. If already received, a 429 error code is returned as this request is duplicate.
- Payment Platform Service writes request to event log then suffers internal failure
- 429 error code is returned as this request is duplicate, detected via payment UUID.
- PSP response to Payment Processing Service is dropped, a duplicate payment request is sent to the PSP
- Each request to the Payment Processing Service contains a payment order UUID mapped by the Payment Platform Service from the payment UUID. When the Payment Processing Service receives the request it detects a duplicate by checking its internal database using the payment order UUID.
- PSP repeatedly fails across multiple requests
- Payment Processing Service opens the circuit-breaker for the given third-party PSP and retries using an alternative PSP
- Payment Processing Service response to Payment Platform Service is dropped, a duplicate payment request is submitted a second time to the Payment Platform Service
- 429 error code is returned as this request is duplicate, detected via payment UUID. Separately, retry logic in the Payment Platform Service retries the request to the Payment Processing Service and receives the dropped prior response.
- Payment Platform Service fails to write to Ledger
- Request to the Ledger is retried, if repeated failure the workflow is reversed to revert prior payment
- Payment Platform Service writes to the Ledger and then fails to write to Wallet
- Request to the Ledger is retried, if repeated failure the workflow is reversed to revert prior payment
Security:
With any software solution we should always carefully consider security and privacy. Given this is a financial service, security is even more important. Let’s discuss various security areas of concern and how we are addressing them.
- Fraudulent transactions and compromised credentials
- Credentials required by PSP and two-factor-auth enabled
- Address verification, card verification, risk engine models
- Man-in-the-Middle attack - OWASP link
- HTTPS encrypted communications
- SSL with certificate pinning, min-version, and revocation list (CRL)
- JSON Web Tokens (JWT) used internally between services
- Distributed-Denial-of-Service (DDoS) attack - OWASP link
- Web application firewall (WAF) with rate-limiting, block-list, and multiple points of presence (PoPs)
- SQL Injection Attack - OWASP link
- Sanitizing all user-entered data prior to processing or persisting it using prepared statements with parameterized queries
- Cross-Site-Scripting (XSS) - OWASP link
- This is less of a risk in the MVP application described but could become relevant quickly. To prevent XSS, all user-entered data (e.g. HTML, CSS, JS, URLs) is sanitized using an allow-list prior to processing or persisting it.
- Cross-Site-Request-Forgery (CSRF) - OWASP link
- This is less of a risk in the MVP application described but could become relevant quickly. To prevent CSRF, a server-side per-request token is generated and sent with the webpage in the initial request, never as a cookie. Only subsequent API requests containing that CSRF token are accepted.
- Data Deletion or Loss
- Data is replicated across availability-zones, cloud regions, and logged to a secured append-only event log and ledger which cannot be modified or deleted by users or employees.
- Principle of least privilege, granting the minimal permissions to each role
- Compliance
- By leveraging PSP’s for all payment processing compliance requirements such as PCI compliance are less needed. Where needed we use AWS PCI-compliant tools.
Conclusion
We have completed our high-level system architecture for building an end-to-end payments solution covering requirements, high level design, individual customer and data flows, persistence, security, and edge cases. I hope this was insightful - keep an eye out for the next system architecture post.
Appendix - Database Table Schemas
Payment Request Table:
- payment_id: string (primary-key)
- user_id: string
- payment_option: string
- is_complete: boolean
Payment Order Status Table:
- payment_order_id (primary-key)
- payment_id (foreign-key)
- from_acct: string
- to_acct: string
- amount: string
- currency: string
- payment_order_status: string
- ledger_updated: boolean
- wallet_updated: boolean