I am acting this week as a Data Engineer. Why? Radix is managing a massive PayPal merchant with over 600K active subscriptions and getting 15K on average in monthly subscription trials. This base data produces secondary massive data every hour, such as failed or successful payments, refunds, and, most importantly, tracking subscriptions by status: Active, Past Due, or Cancelled.
Our current data pipeline manages 95% of our customers perfectly fine, but we have some customers with massive data requirements that necessitate a more powerful data pipeline. Radix's bread and butter is data accuracy; therefore, data ingestion, processing, and storage are key; otherwise, we won't be able to assist them effectively in helping them to track and analyze their revenue metrics.
Now, the issue lies in the following: as a small startup, we need to use our resources very strictly to increase our margin per customer without compromising quality. A well-implemented data pipeline can definitely help.
🔗 Data Fetching → Data Transport → Data Processing
1) AWS Fargate: Running a Python script to fetch data from the PayPal API in batches.
2) SQS: Queuing messages and storing data until it's ready for processing (perfect for batch processing).
- Note: While Kafka is ideal for real-time data streaming, our PayPal APIs update every 4 hours, so it’s not necessary for our current needs. Plus, Kafka can be costly!
3) MongoDB: Storing raw data in batches. However, with gigabytes of data per fetch, it can be expensive. For maintaining speed and efficiency, AWS S3 is a better option for raw data storage.
- Note: If you need to store gigabytes of raw data, AWS S3 is the best option. You can process the data from there and then store the processed data in MongoDB. This way, you will save tons of money.
4) Apache Spark: Our secret sauce! Designed to process terabytes of data in real time with the right architecture. This enables us to create vital revenue metrics like MRR, Churn, LTV, and subscription statuses.
- Note: If real-time metrics are needed, processing data directly from the transport layer (SQS or Kafka) is ideal, with Kafka being the better option for streaming. Always think about reducing computing power to save costs!
5) Save the processed data, which is the bread and butter. In this case, you have several options:
- Once the data has been processed by Apache Spark, you can directly save it back to MongoDB in a new collection.
- Alternatively, you can save the processed data in a new database or data lake. In my case, I will use MongoDB to store both the raw and processed data.
#DataEngineering #AWS #MongoDB #ApacheSpark #PayPal #DataAnalytics #Innovation #TechForGood