Openstatus www.openstatus.dev

๐Ÿ“ infra blog post (#1127)

* ๐Ÿ“ first draft blog

* ๐Ÿ“ article update

* ๐Ÿ“ blog

* ๐Ÿ“ improve blog post

* ๐Ÿ“blog post improvment

* ๐Ÿ“blog post improvment

authored by

Thibault Le Ouay and committed by
GitHub
99d3a10c 917979bb

+117 -1
+4 -1
apps/web/.env.example
··· 72 72 WORKSPACES_LOOKBACK_30= 73 73 74 74 # https://turbo.build/repo/docs/crafting-your-repository/using-environment-variables#loose-mode 75 - TURBO_ENV_MODE=loose 75 + TURBO_ENV_MODE=loose 76 + 77 + OPENPANEL_CLIENT_ID=something 78 + OPENPANEL_CLIENT_SECRET=something
apps/web/public/assets/posts/infra-openstatus/hosting.png

This is a binary file and will not be displayed.

apps/web/public/assets/posts/infra-openstatus/queue.png

This is a binary file and will not be displayed.

apps/web/public/assets/posts/infra-openstatus/tech-infra.png

This is a binary file and will not be displayed.

+113
apps/web/src/content/posts/openstatus-infra.mdx
··· 1 + --- 2 + title: "Building OpenStatus: A Deep Dive into Our Infrastructure Architecture" 3 + description: 4 + Let's deep dive in the infra behind OpenStatus. Learn how we built it and how we host it. 5 + author: 6 + name: Thibault Le Ouay Ducasse 7 + url: https://bsky.app/profile/thibaultleouay.dev 8 + publishedAt: 2024-12-29 9 + tag: engineering 10 + image: /assets/posts/infra-openstatus/tech-infra.png 11 + --- 12 + 13 + ## Infrastructure Overview 14 + 15 + OpenStatus is a synthetic monitoring platform designed with resilience, scalability, and efficiency in mind. 16 + Our users rely on us to provide real-time insights into their service health, making it essential to maintain a robust and performant infrastructure. 17 + 18 + In this post, we'll take a deep dive into our infrastructure architecture, exploring the key components, managed services, and design principles that power OpenStatus. 19 + 20 + ## Application Landscape 21 + 22 + Our platform consists of several interconnected applications, each designed for a specific purpose: 23 + 24 + 1. **Frontend Ecosystem**: 25 + - A NextJS application that powers our marketing site, user dashboard, and status page hosted on [Vercel](https://vercel.com/). 26 + - An Astro + Starlight-powered documentation application hosted on [Cloudflare Pages](https://pages.cloudflare.com/). 27 + 28 + We chose Vercel for the Next.js application because it performs exceptionally well there, the DX is great. And we selected Cloudflare Pages for the documentation since it is a static site and it's super cheap. 29 + 30 + 2. **Backend Infrastructure** All our backend services are hosted on Fly.io. 31 + - API server: Our public API and our alerting engine 32 + - Probes/Checker: a golang app deployed globally to monitor your service 33 + - Screenshot app: a service that takes screenshot of your website when we detect an downtime (Playwright) 34 + - Workflow engine: a server that handles the workflow of alerting, and our internal workflows (email automation). 35 + 36 + We chose Fly.io for our backend services because it's a great platform for deploying globally distributed services. It's also very easy to deploy and manage. 37 + We are planning to add more providers (e.g. Koyeb) to our probes to have a more resilient system. 38 + 39 + <Image 40 + alt="Hosting providers" 41 + src="/assets/posts/infra-openstatus/hosting.png" 42 + width={650} 43 + height={575} 44 + /> 45 + 46 + 47 + ## Managed Services 48 + 49 + We also rely heavily on managed services to avoid handling it ourselves. Here are the services we use: 50 + 51 + ### Scheduling 52 + 53 + Recognizing the critical nature of monitoring, we've heavily rely on CRON to ensure timely checks: 54 + 55 + - **Cron Jobs**: Currently using Vercel Cron, with plans to migrate to Google Cron for an enhanced user experience (better UI e.g. we can see when the cron ran, retry policy). 56 + 57 + 58 + ### Queue Architecture 59 + 60 + Due to the critical nature of checks, we are using a queue to handle task processing and retry logic: 61 + 62 + Every check is pushed to a queue and processed by our probes. If the probe fails to process the check, it is retried 3 times before being marked as failed. 63 + 64 + - **Job Queue**: Google Task Queues provide our distributed task management, with strategically segmented queues for different check frequencies 65 + 66 + We've implemented a granular queue system to ensure efficient task processing, each queue is dedicated to a specific check frequency (e.g. every minute, every 10 minutes). 67 + 68 + 69 + <Image 70 + alt="Queue providers" 71 + src="/assets/posts/infra-openstatus/queue.png" 72 + width={650} 73 + height={575} 74 + /> 75 + 76 + 77 + ### Data Infrastructure 78 + 79 + We also don't want to handle the data infrastructure by ourselves. We rely on managed services for that: 80 + 81 + - **Primary Database**: [Turso](https://turso.tech?ref=openstatus.dev), providing a cost efficient data storage solution. We love the fact that's it's hosted SQLite database. It's just a file we can embedded in our services and sync it periodically. 82 + - **Analytics Database**: [Tinybird](https://www.tinybird.co?ref=openstatus.dev), enabling complex analytical queries and insights. 83 + 84 + ## Design Philosophy 85 + 86 + Our infrastructure design is driven by several key principles: 87 + 88 + - **Resilience**: Ensuring high availability and fault tolerance 89 + - **Scalability**: Architectural choices that allow seamless growth 90 + - **Cost-Efficiency**: Leveraging managed services and cloud credits 91 + - **Performance**: Optimizing each component for maximum efficiency. 92 + 93 + 94 + ## How much does it cost us? 95 + 96 + Our current monthly cost is around $328. This includes: 97 + 98 + - Vercel: $40, we are two members in the team, so we had to upgrade to the team plan. 99 + - Fly.io: $154 36*4 (all our probes at $4 average, not all regions cost the same) + $10 (for the api server) 100 + - Google Cloud Platform: $0 (We are still using the free credits, but we expect to pay around $50 for the queue) 101 + - Tinybird: $100 102 + - Turso: $29 103 + - Cloudfare: $5 104 + 105 + 106 + 107 + # Conclusion 108 + 109 + Building a resilient synthetic monitoring platform is hard. It's not just a $5 VPS that you can deploy and forget. It requires a more complex infrastructure to be able to provide a reliable service. 110 + 111 + The drawback of this approach is the complexity of providing an easy self hostable services. Which is annoying because we are an open-source project and we want to provide a self-hostable version of OpenStatus. But we are working on community edition that will be easier to deploy. 112 + 113 + *Want to start monitoring your services with OpenStatus? [Sign up for free](https://www.openstatus.dev/app/login?ref=blogpost-infra) and get started today!*