247

Stack Overflow has existed since 2008. At that time, it was a logical choice to run this from our own hardware in a datacenter. The company started with a small group of people who owned every aspect of the application, from the infrastructure to the code. We built a monolithic application that scaled incredibly well and we squeezed the most out of the hardware we had.

It was a great time.

But now, the time has come to leave our data center and move to the cloud.

Why are we moving to the cloud?

The tight control we had over our hardware really helped us build a scalable application that was cost-efficient. However, the decision to move to the cloud was instigated by three unrelated events:

Our data center in New York recently announced that it’s going to close. We would need to move all our hardware to a new location. Moving would be expensive, but most importantly, it would be an all-consuming project for 4-5 Site Reliability Engineers (SREs) for many months. That time could be better spent moving to the cloud.

Additionally, our hardware is reaching its end-of-support date and would need to be refreshed. This would be very expensive. Should we spend that money on continuing our old direction or use this as an opportunity to explore something new?

Lastly, maintaining our data centers was becoming a distraction. We estimate it would take 2-4 full-time people to properly maintain all our hardware. Sometimes hardware breaks, and this means our engineers have to physically go to the data center to fix things. Owning the hardware also means we have to do all the maintenance ourselves and sometimes upgrade the hardware when bits and pieces go out of support. This takes time and money from other things we want to do.

We’re already using the cloud for our Stack Overflow for Teams and Stack Overflow Enterprise products. Stack Overflow for Teams originally ran in the data center too, but a little over a year ago, we split Teams from stackoverflow.com and moved it to Microsoft Azure (you can read more about this journey here and here). Teams has since then moved from virtual machines to Kubernetes, and we started deploying independent microservices that are no longer part of the original monolith. Stack Overflow Enterprise has been running in Azure for a long time, giving each customer its own isolated infrastructure and scalability, and it, too, is moving to Kubernetes.

We love the cloud. People don’t have to go into the data center anymore (which is really handy given that we are a fully remote, international company). We also gain a lot of flexibility with our cloud usage. In the data center, we were constrained by how much hardware we wanted to buy and maintain. Cloud gives us flexibility to use hardware when we need it and stop paying for it once we’re done. We were already doing CI/CD on-premises, but in the cloud, we now also have Infrastructure as Code making it much easier to manage all our cloud resources. Add the Docker and Kubernetes foundation we’ve built for Teams, and we are in a pretty good place with our cloud usage.

Now we’re going to take ‘everything else’ to the cloud.

Everything else includes the full Stack Exchange network: stackoverflow.com, all Stack Exchange sites, all meta sites, all apps from Area51 and Chat to internal apps like StackMail (our email sending service), and Scheduler (which runs scheduled tasks, such as badge awards). This is a massive project that we’ve been working on for a while, and we want to finish by June 2025.

To make it explicit, however, we are not going to the cloud to save money. We know that cloud is often more expensive than running your own hardware. We are of course monitoring our cost and by right sizing our cloud usage, using tools such as Azure Reserved Instances and auto scaling our resources based on load we keep our costs to a minimum. However, the cost is worth the new flexibility. Instead of projects being delayed while new hardware was procured, installed, and configured, we can spin up new capacity in minutes. We can even set up capacity experimentally and throw it away afterwards. This should help us to get new features out of the door faster.

Where are we going?

As discussed, our Teams products run in Microsoft Azure. For the public platform, we’ve decided to move to Google Cloud. We have a tight partnership with Google and we think they are the right cloud provider for the public platform.

Who is working on the migration to the cloud?

The internal name for moving our public platform to Google Cloud Platform (GCP) is called Project Ascension. The Ascension team consists of several developers, site reliability engineers, our database team, and a community manager.

As you’ve probably guessed by now, I’m one of the developers working on this project. I’ve worked on the migration of Teams to Azure and it was a logical choice to bring that knowledge to Ascension.

One of the other developers is Steve Vakil. He came to Stack Overflow in 2022 after many years of software development work for government public assistance programs, is intimately familiar with our ads product, and is a Google Certified Cloud Engineer. The other two developers are balpha and Adam Lear who both have a rich history at Stack Overflow and know everything there is to know about our products and all the things we have to think about with such a massive project.

On the SRE side, we’re joined by Mike Frank, Jason Schwanz and Tom Limoncelli. They’re designing and building our cloud infrastructure which is no small feat considering the complexity of applications that are built with decades of assumptions about the environment they run in.

And, on the DBRE side, Aaron Bertrand, who was at the helm for migrating the databases supporting Stack Overflow for Teams to Azure, will be carving out the database infrastructure in GCP and planning and performing the migration of the entire network of databases out of our on-premises data centers.

What have we done so far?

At the end of 2023 we started planning Ascension and thinking about all the things we have to do. Early this year, the team slowly came together and we started the real work.

We now have built out the core of our infrastructure in GCP, such as Cloudflare support, networking, Kubernetes clusters, database virtual machines, security and policies. This is all built on Terraform and automated CI/CD pipelines. We then started deploying all the applications that are required to run stackoverflow.com into Kubernetes in our test environment.

Our biggest unknown was how our application would perform in a cloud environment. Our app was built within the constraints of the data center, where we knew exactly how much latency we had and what we could squeeze out of our servers. Could we make our application run in a cloud environment or do we need to make big architectural changes?

We are happy to say that we did make it work. We build a set of load tests using k6 and we slowly build up the load we can support by tweaking our infrastructure - from the right Kubernetes node sizes to nginx settings and SQL Server VMs. We now think we’ve learned everything we can from these synthetic load tests and we’re ready to move on to the next step.

What will be the next steps?

One step is already done: Stack Snippets is running in GCP without any issues. Stack Snippets is a relatively simple app that does not need a database or Redis. It helped us prove out that our infrastructure is ready. Using Stack Snippets we showed that we have the monitoring and observability in place to run applications in production in GCP.

Our next step is moving more complex applications, one by one, to GCP. We want to avoid doing a “big bang” where we migrate everything at once, since that introduces too many unknowns and too much risk. Over the coming months we will announce which applications we’re moving and move these one by one to GCP. Some of these announcements have taken place on Meta in separate posts, but we plan to consolidate them as an update on this post.

Once we have our internal applications all running in GCP, we plan on moving Stack Exchange sites to GCP. Our current plan is to start with some smaller sites that don’t take a lot of traffic so that we can do basic testing. After that, we will likely move meta.stackexchange.com to GCP so we can test performance with more traffic.

Once we’re confident that our GCP environment is stable and can sustain our traffic we will move our big sites, and we will end the migration with stackoverflow.com. While we'll do our best to minimize downtime, big changes like these frequently bring short periods of instability. We apologize in advance and appreciate your understanding.

Now of course this plan is very much in flux. We will learn more and more while we execute on it and things will change and we’ll try our best to keep you up to date.

I personally am really happy that we’re moving to the cloud. Having flexible infrastructure that’s deployed and managed automatically will give us a lot of opportunities to improve the way we work and build new features.

We are just at the beginning so we will keep you posted on what’s going on.

41
  • 64
    Does this choice come with any form of "privileged" data access for Google? Seems like not only they will be able to increase their hold on the CC dataset (and probably adding some more hops for competitors) but also be in a far more favorable position to profile people. Commented Nov 13, 2024 at 8:52
  • 136
    @ꓢPArcheon Be nice, no conspiracy theories before Wouter and I have had our first coffee 😄 Our servers used to live in a datacenter where Google owned the whole building IIRC (I don't remember the details 100%, so don't quote me on that), and to my knowledge they didn't syphon off our souls there either.
    – balpha StaffMod
    Commented Nov 13, 2024 at 9:14
  • 4
    And what will be user impact of this? Just servers external IP addresses changing, or something more?
    – SmallSoft
    Commented Nov 13, 2024 at 9:57
  • 18
    @MSDN.WhiteKnight IP addresses are set by Cloudflare and will stay the same. Functionally, the apps should also stay the same. There could be performance changes but we're doing our best to keep the app performing as good as it is today.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 11:14
  • 25
    I came up with the name Ascension.... we are ascending to the cloud.
    – Tom Limoncelli StaffMod
    Commented Nov 13, 2024 at 13:24
  • 28
    Wow, I mean the trend is currently back away from the cloud given high costs
    – LinusG.
    Commented Nov 13, 2024 at 16:17
  • 30
    @KennethKho This is the first time I've seen someone describing a "move to cloud" where the reasons are even vaguely justifiable. So long as they don't go all-in on proprietary microservice meta-APIs, they might even spend less time cloud-wrangling than the server-wrangling was taking them!
    – wizzwizz4
    Commented Nov 13, 2024 at 17:12
  • 9
    @user1271772 At this scale and with these performance and availability requirements? Of course! I would say that 4 engineers and a few months is actually extremely efficient. Commented Nov 14, 2024 at 10:52
  • 8
    @user1271772 I've worked on a migration of this sort before. I'd say it was a 'more' critical piece of infrastructure but fewer machines, and less monolithic, and took us over a year. And we didn't need to find physical space (and our options for hardware was relatively limited)... took over a year. Halfassing a move is easy. Getting it right takes time and effort, even if you don't scotty it. Commented Nov 14, 2024 at 12:19
  • 5
    @MikeRichardson yes... you have to think of everything and also have plans for many eventualities. It's not as easy as "unplug in the old place and plug in at the new place". You have to install the new system in the new place, move over all the date (which is not easy on a system that is updating its data permanently)... It's a lot of work of planning, testing, replanning, changing, planning again... and at one time pull the plug!
    – kruemi
    Commented Nov 14, 2024 at 14:20
  • 67
    +1 not because I have any informed opinion on cloud- vs. self-hosting, but because the engineering side's long-standing commitment to informing and educating the userbase is admirable. Thank you for sharing, explaining, and listening.
    – nitsua60
    Commented Nov 14, 2024 at 14:59
  • 5
    @user1271772 Not if you are manic (and probably accept an order or two of magnitude more outages/glitches). Commented Nov 15, 2024 at 11:16
  • 9
    It's been 3 days and no body has mentioned about hamsters! What will happen to the hamsters now that the servers are being shutdown? Will they be unemployed? Commented Nov 16, 2024 at 17:32
  • 16
    @BhargavRao Thank you for asking! We will, of course, support them during this transition and find another loving company to take them in.
    – Adam Lear StaffMod
    Commented Nov 17, 2024 at 4:25
  • 5
    @JamesGeddes yes :) For our scenario, we don't agree.
    – Wouter de Kort Staff
    Commented Nov 20, 2024 at 8:15

15 Answers 15

20

Rather than create a ton of individual posts for individual bits and pieces of the migration, I thought it would be helpful to have a single aggregator answer. Where better to put it than directly on the announcement post?

Fields in this table are:

  • Component: The name of the piece of our infrastructure we are migrating.
  • Description: What that component actually does on the network.
  • Downtime risk: How likely it is for there to be downtime, and what impact that downtime would have, if any.
  • Migration date: Our best estimate of the date on which the migration will be performed.

Dates in this table should be regarded as best estimates. We'll do our best to keep this table up to date. (Only) dates listed in bold are firm dates, and this expresses that we feel the date is unlikely to change. Firm dates may still move as required to meet internal needs. I'm also listing entries as "not yet scheduled" even when I could probably make a reasonable guess, because I'd rather tell you what I know than I what I simply assume is true.

Particularly large, risky, or interesting parts of this migration will probably still get their own posts. Most components of the Stack Exchange network have an "application" and a "database" component that can be migrated separately. The application supplies both the back end and front end logic to the network, and the database houses the data that application utilizes. Not all applications have a corresponding database.

The following table lists our current estimated migration dates:

Component Description Downtime risk App migration date Database migration date
StackSnippets Renders StackSnippets on Stack Overflow. Already done! Not applicable
StackAuth Stack Exchange's login and authentication service. Already done! Not applicable
Chat (Bonfire) (some prep work discussed here) Bonfire runs all of the chatrooms on Stack Exchange (chat.meta.stackexchange.com, e.g.). The application hosts our Chat sites around the network. Chat should only go down if the migration is not successful; in this event, chat will likely be unavailable network-wide until the issue is repaired. Completed on December 3rd, 2024 No earlier than January, 2025
stackexchange.com stackexchange.com serves stackexchange.com. Intuitive, right? If the migration is unsuccessful, stackexchange.com would be inaccessible No earlier than January, 2025 No earlier than January, 2025
Area51 Area51 refers to area51.stackexchange.com, the proving ground for new site proposals. If Area51 goes down, area51.stackexchange.com may be inaccessible or otherwise not work properly. No earlier than January, 2025 No earlier than January, 2025
Stack Exchange Data Explorer (SEDE) SEDE serves the public resources at data.stackexchange.com. If this services experiences downtime, users visiting the Data Explorer may experience odd behavior such as queries failing to complete, or may be unable to access the server at all. Not yet scheduled Not applicable (SEDE is an app that interfaces with the database; the database migration happens at the same time as site DB migrations below.)
SocketServer SocketServer serves data through all WebSocket connections around the network. If SocketServer experiences downtime, realtime updates on the network will not function until service is restored. The website will otherwise continue to function. No earlier than January 2025 Not applicable
StackMail StackMail serves emails to users around the network. In the event StackMail goes down, users may not receive emails as expected; these emails may not send once service is restored. No earlier than January 2025 No earlier than January, 2025
Ad server Our ad server determines which users see which ads, and on which posts they see those ads. In the event our ad server goes down, users will still see ads, but those ads may seem less relevant than usual, or they may appear in unexpected places. No earlier than January 2025 No earlier than January, 2025
First site migration We will select a first site to migrate to GCP. This will likely be a smaller network site. In the event of downtime, that site may experience temporary disruptions until restored. No earlier than February 2025 No earlier than February, 2025
API The API allows users to interact with Stack Exchange via software they write. If the API experiences downtime, user-created applications may go down or not function. Not yet scheduled Not yet scheduled
All network sites migration After we test the migration on a few smaller sites, we will move the entire network except SO to GCP. In the event of downtime, all network sites may experience temporary disruptions until restored. Not yet scheduled Not yet scheduled
Stack Overflow migration Not yet scheduled Not yet scheduled

In general, we expect that most database migrations will occur at most a few days after the application is migrated. Note that downtime may occur both when the application is migrated and when the database is migrated situationally. If a significant outage is required, we'll also post a planned outage notice here on Meta as usual.

10
  • 6
    " they may appear in unexpected places." How unexpected exactly? I am not expecting ads to suddenly appear, floating in mid air. Would they? :D Commented Nov 28, 2024 at 1:27
  • 1
    More seriously - with chat, will SO, Chat.SE and Chat.MSE be migrated at once, or staged? Commented Nov 28, 2024 at 1:28
  • 3
    @JourneymanGeek Don't, uh, look behind you...
    – Adam Lear StaffMod
    Commented Nov 28, 2024 at 1:32
  • 2
    @JourneymanGeek All chat instances will be migrated at once.
    – Wouter de Kort Staff
    Commented Nov 28, 2024 at 9:55
  • Could Area51 migration impact the sidebar stats of a beta site?
    – Tim
    Commented Dec 1, 2024 at 16:36
  • 7
    @JourneymanGeek You'd better believe it. The ads are going to haunt you, haunt you like a spectre. The ads will scrape and claw slowly along the wooden floorboards, creating strange and terrifying sounds in the night. Their soft glow will be your only warning. They might also appear on the website, too, but they're much less terrifying when they're in the computer.
    – Slate StaffMod
    Commented Dec 2, 2024 at 18:05
  • 3
    Ah, so a little like having a cat. Commented Dec 2, 2024 at 18:21
  • 7
    @JourneymanGeek Yes, exactly - a cat is simply a remarkably destructive advertisement for canned fish.
    – Slate StaffMod
    Commented Dec 2, 2024 at 18:40
  • 1
    @Tim Don't think so.
    – Slate StaffMod
    Commented Dec 4, 2024 at 18:59
  • I assure you - Google, Stack's cloud provider, DOES care about ONLY the money. And will make ABSOLUTELY sure that they get it. Maybe they are treating Stack like an advertisement in how they get their money but the WILL get it. Stack's peeps are bubbled. Unlimited funds available, we don't care about making it cheap for our paying customers we just care about how great it is for us. Until 10 years later an alternative to Stack comes out that is saving money by running their own datacenter and Stack goes the way of Geocities. That's the reality of "going to the cloud" for most businesses Commented Dec 15, 2024 at 14:34
117

At a purely emotional level, as a bit of an old timer - the move from a 'traditional' hosting set up (with tech refreshes having in depth blog posts talking about the hardware choices) makes me a little sad.

That said, at a practical level, the ability to scale (hopefully upwards!), and a certain ability to avoid potentially needing a bucket brigade to keep the servers running or being brought low by Redundant UPSes not really being redundant. It kind of makes sense.

I hope its a boring and rather routine move for all involved.

13
  • 16
    Thank you! I do hope at some point to have the time to write up a blog post about what we did and what our GCP setup looks like as I did for the FBB move to Azure.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 9:10
  • 9
    Not even the cloud is safe from power failures.
    – Bergi
    Commented Nov 13, 2024 at 17:55
  • 7
    They're a bit later to the trend though. world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0 I suspect StackExchange load is relatively stable these days, so it doesn't actually benefit much from renting virtual capacity (cf. "peak load premium" specbranch.com/posts/one-big-server ).
    – Nemo
    Commented Nov 17, 2024 at 20:53
  • 5
    @Nemo our load varies a lot throughout the day and week. Most of our traffic comes from the US and during working hours.
    – Wouter de Kort Staff
    Commented Nov 18, 2024 at 8:53
  • 16
    It's also sad that nobody runs servers anymore. Not mail servers, not ftp servers, not http servers, not finger, usenet, irc, TeamSpeek, Ventrilo servers. Nobody remembers the entire point of the Internet: to connect all our computers together in a giant world-wide LAN. Instead all servers are being hosted by 1 of 3 businesses. At least Stackoverflow was an example of how Microsoft products can power one of the most heavily used web-sites on the planet. Nobody runs servers anymore.
    – Ian Boyd
    Commented Nov 19, 2024 at 20:45
  • @IanBoyd Are you calling the hundreds of data center engineers at those "big three" (ignoring the fact that those data centers are shared with other companies) as "nobodies"? The "average user" doesn't want to procure and maintain hardware... Everyone is still "installing/running server software" Commented Nov 27, 2024 at 4:47
  • @OneCricketeer Depends? Do they host their own services?
    – Ian Boyd
    Commented Nov 27, 2024 at 4:54
  • @IanBoyd I feel like we have different definitions of servers. That's all. But, me personally, yeah, I run an HTTP, VPN, DNS, SFTP, and DHCP server at home, all on separate hardware. Just happens that my ISP provide alternative solutions doesn't mean "I don't run them". Anyone that sets up Kubernetes is installing server software. Any containers running a REST API on it are HTTP servers. All nodes in the cluster are typically SSH servers Commented Nov 27, 2024 at 4:56
  • 1
    @WouterdeKort Thanks. Varies by how much? Wikimedia Foundation traffic in the USA for example varies between one third and two times the mean over the week. grafana.wikimedia.org/d/000000093/… Typically the peak load premium you pay on the public cloud is more like an order of magnitude (if not two), so it can be cheaper if you need to occasionally scale to ten times your average capacity for a few hours but otherwise costs more.
    – Nemo
    Commented Nov 27, 2024 at 9:35
  • 2
    @Nemo as mentioned, costs are not our main concern.
    – Wouter de Kort Staff
    Commented Nov 28, 2024 at 9:54
  • @Ian Boyd Not to mention that many sites need multiple providers, often sitting in front of their properties and concealing where your data is actually stored.
    – William
    Commented Dec 1, 2024 at 22:14
  • As I gotten older I noticed that I became the tech racist I rebelled against once. Still, cloud bad. Its a matter of principle. Commented Dec 7, 2024 at 13:33
  • They are moving to the cloud because buying new hardware would be expensive and take up "weeks" of time for 4-6 SRE's lol... Who's gonna tell em? Commented Dec 10, 2024 at 23:07
34

If memory serves, SE used to/does have a back up datacenter location. I see talk of one datacenter closing down on the main post. What's happening at the other and what's the plans for DR?

I do realise in theory another instance of a service can be spun up but - what's the plan for redundancy, for situations like a availability zone (or its google cloud equivalent) going down? Would there be geographical/zone redundancy, or is a single zone seen as sufficient?

1
  • 20
    Yes, we also have a location in Colorado and that one will go away with the move to GCP. In GCP we will have a highly available deployment with multiple Kubernetes clusters and a highly available SQL Server cluster.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 10:57
31

I noticed it was mentioned that SOFT and SO Enterprise runs on Azure - what's the benefit/reasoning for the move to GCP over Azure?

5
  • 14
    99% sure the choice is financial. Azure isn't cheap and they might be able to get some serious "newcomer" discounts from GCP
    – Robotnik
    Commented Nov 13, 2024 at 10:52
  • 36
    I understand this is an interesting question but as you guessed it gets a bit too close to financial details that I can't talk about.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 15:33
  • 27
    From the Google "strategic partnership" announcement: "Stack Overflow has selected Google Cloud as the platform of choice to host and grow its public facing developer knowledge platform." I guess some of Google's payment for licensing community content was in-kind rather than money. Commented Nov 14, 2024 at 3:31
  • 1
    And they were willing to pay for all of those Microsoft software licenses, too? (As I understand it, they are heavily reliant on SQL Server in particular.)
    – SamB
    Commented Nov 19, 2024 at 23:44
  • 2
    They did get them discounted early on - and they'll still be running SQL Server, just on a different platform. Commented Nov 19, 2024 at 23:51
16

Our biggest unknown was how our application would perform in a cloud environment. Our app was built within the constraints of the data center, where we knew exactly how much latency we had and what we could squeeze out of our servers. Could we make our application run in a cloud environment or do we need to make big architectural changes?

We are happy to say that we did make it work. We build a set of load tests using k6 and we slowly build up the load we can support by tweaking our infrastructure [...]. We now think we��ve learned everything we can from these synthetic load tests and we’re ready to move on to the next step.

[...] After that, we will likely move meta.stackexchange.com to GCP so we can test performance with more traffic.

So I'm hearing that things can run, and it can handle load, and performance will continue to be tested. Great!

But what about server-side responsiveness, which affects time-to-first-byte for clients? More specifically, I mean the time between the first request byte reaching GCP and the first response byte leaving GCP. Is there any goal to keep that measure of latency on par with current performance levels?

I vaguely recall server-side responsiveness (at least for Q&A pages) being a point of concern and pride for the people who first built this platform (I don't know how much that's changed up until now). I think that's one of this platform's current strengths and it'd be a shame (and pain as a daily user) to see that get slower.

1
  • 20
    Sorry for not being clear, our performance tests are there not only to support the load we have but also to make sure we get close to what the current end user experience is. The dashboard we have for our load tests shows APM traces, Cloudflare stats and more. Now I can't promise things will be as great as they are today. It's true that the site is extremely fast today in the data center but we are doing our best. Once Ascension is finished (we're in GCP and the data center is empty) we get into a hardening phase where we can improve upon what we have.
    – Wouter de Kort Staff
    Commented Nov 14, 2024 at 13:18
12

After that, we will likely move meta.stackexchange.com to GCP so we can test performance with more traffic.

I would suggest to pick a different guinea pig and not meta.stackexchange.com, because having that functioning correctly during the transition is going to be mission critical (bug reports, announcements, etc)

7
  • 26
    Eh. If Meta breaks it should be pretty obvious to lurking staff, and I'd rather Meta breaks than a site that serves quality content.
    – Spevacus Mod
    Commented Nov 14, 2024 at 23:04
  • @Spevacus: There are all kinds of non-obvious ways for things to break, too
    – Ben Voigt
    Commented Nov 14, 2024 at 23:06
  • We could tweet the corporate twitter channel :D. More seriously - I'd say having alternate comms methods, maybe using MSO if the SO mods/community are chill with it or the blog might be options. Commented Nov 14, 2024 at 23:44
  • 9
    On the whole, I'm not too worried about issues failing to bubble up in a timely manner. Of course, now that I've said that... Anyway, chat should work fine if posting a bug report on Meta becomes impossible for non-obvious reasons.
    – Slate StaffMod
    Commented Nov 15, 2024 at 0:09
  • 8
    i.sstatic.net/z1NozFh5.jpg sounds like a plan :D Commented Nov 15, 2024 at 1:19
  • 9
    meta will not be the first site to go. Once we hit the 'meta milestone' we should be pretty confident it won't break basic functionality like posting a question.
    – Wouter de Kort Staff
    Commented Nov 15, 2024 at 8:48
  • 12
    I'd rather Meta breaks than a site that serves quality content - My mind read that as Facebook. Commented Nov 15, 2024 at 21:42
11

Will a full transition to the cloud affect the availability of the site in different countries or regions? And one more thing: I'm sure it will be very interesting for all of us to find out in detail how the transition is going from a technical point of view: how is the continuous operation of the site organized during the transition: the servers work together for a while or how and how do you backup and transfer data!

3
  • 2
    There will be no impact on where the sites are available. Everything stays the same.
    – Wouter de Kort Staff
    Commented Nov 15, 2024 at 18:22
  • 10
    I would love to blog about all the technical details! Just have to get the project done first I'm afraid.
    – Wouter de Kort Staff
    Commented Nov 15, 2024 at 18:23
  • 4
    @WouterdeKort That would make a nice change from the fluff pieces they've been posting lately :-).
    – SamB
    Commented Nov 19, 2024 at 23:49
11

While I understand that changing two things at once is not the best engineering thing to carry out, I would still like to ask whether plans for are around the corners. Given the readily availability of IPv6 capability on Google / GCP, I hope it would be easier than before to embrace this future-proof internet protocol.

1
  • 5
    There are no plans to do this as part of Ascension but we do agree that we need to fix this. Some of the problems in moving to ipv6 will go away with our move to GCP and leaving some old systems behind. There is a lot of things we want to do after Ascension so I can't make any promises but we will definitely take ipv6 into consideration.
    – Wouter de Kort Staff
    Commented Nov 21, 2024 at 9:20
5

I've heard some large customers of cloud services use exclusively VMs and S3 storage (and maybe managed K8s clusters), for fear of vendor lock-in. Is SE planning on something similar, or is it going to embrace advanced cloud services like serverless (cloud functions etc.)?

2
  • "(and maybe managed K8s clusters)": Per Wouter, they strongly prefer K8s over VMs.
    – Brian
    Commented Nov 22, 2024 at 14:24
  • In my experience with k8s., this is a skill issue rather than one architecture being "better". How is k8s vendor lockin? Simply containerizing the services is step 1... Those could still run in vms and use solutions like minio for s3 storage outside aws Commented Dec 2, 2024 at 13:39
2

If it is too expensive to run the entire thing on the ground, I believe there is still an advantage to maintaining one data center as a redundancy, institutional knowledge can be maintained as the data center will handle some of the workload, leverage can be provided in any future cloud negotiations, and in case Big Tech are hit by the proverbial bus, SO/SE will live on.

9
  • 4
    "If it is too expensive to run the entire thing on the ground" - the announcement explicitly states that this is not the case and is not a factor in this decision being made.
    – F1Krazy
    Commented Nov 13, 2024 at 14:21
  • 10
    @F1Krazy "Moving would be expensive", "This would be very expensive", "This takes time and money" are sure signs that it is a factor, even if it is not the factor, in the decision. Flexibility could be the reason, but putting all eggs in one cloud, also hurts flexibility. Commented Nov 13, 2024 at 14:25
  • 1
    The upcoming closure of the data center is the trigger, but overall they're going to pay more now, at least that's what they claim, and I don't have a way to validate or invalidate this. Commented Nov 13, 2024 at 14:57
  • 21
    The trigger is the closing of the data center. We want to get to the cloud because of the benefits it brings us. Keeping one data center around doesn't help with that. Small example: running Kubernetes in GCP is very easy. Maintaining it on-premises is much harder. Keeping 1 data center would force us to do that and not allow us to use any other cloud native services that Google offers.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 15:07
  • 1
    @WouterdeKort In the beginning, you said "We built a monolithic application that scaled incredibly well and we squeezed the most out of the hardware we had." Is there a reason why this approach can't continue? Perhaps adding one data center if necessary to keep the status quo. I don't see SO/SE introducing big features as well, which is one of the main rationales of moving to the cloud, except perhaps AI in the future, but AI can be separated from the rest. Commented Nov 13, 2024 at 15:57
  • 6
    @KennethKho I'm not a product roadmap expert, but we are definitely planning on adding features and investing in the public platform. The move to the cloud will help with that on a platform level.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 16:10
  • @WouterdeKort I certainly see that new features are added, and you may not know specifics, but my impression is the features won't be too big that makes it untenable to stay on the ground, do you roughly see the new features as big enough to justify the move? Commented Nov 13, 2024 at 16:18
  • 4
    "and in case Big Tech are hit by the proverbial bus, SO/SE will live on" - the data will live on as long as the data dump does. As long as the data dump lives, depending on how fast the bus is, up to three months of data could be lost, but as little as 0 if there's time to export the entire database before the cloud dies. There's also plenty of redundancy as long as the torrents remain seeded Commented Nov 14, 2024 at 0:38
  • 4
    in theory, if google does decide to kill off GCP - building stuff as containers should make moving easier. K8's is a bit of a common standard - its work but should the worst happen, moving would be a pain, not a epic struggle. Commented Nov 14, 2024 at 8:32
2

When you say you there might be downtime:

While we'll do our best to minimize downtime, big changes like these frequently bring short periods of instability. We apologize in advance and appreciate your understanding.

do you mean read-only time/unable to log-in time like normal maintenance, or do you mean the site is completely down and can't be accessed at all?

Also, do you have any timeline of when you expect each move to take place and when you except the site might go down?

7
  • 1
    I think its more of stuff working in testing, then going horribly wrong in production. I had a recent (personal) project that involved deploying something new. It worked in testing, it failed when I moved it to my server, and for extra fun, I have no clue how it worked in the first place Commented Nov 13, 2024 at 13:12
  • 13
    We do hope that nothing goes horribly wrong in production and we have plans to rollback changes. But there will be small amounts of downtime for individual apps when we have to switch over their database to GCP. This will be communicated in advance but we don't have a fixed schedule.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 14:03
  • @WouterdeKort Okay, thank you. What does "downtime" mean though.
    – Starship
    Commented Nov 13, 2024 at 15:41
  • 7
    @Starship downtime in this case would be the app being unavailable. As you can imagine, that means different things for different apps. For example: StackMail down means no emails can be send. The Scheduler being down means that badge grants won't run. We will share those details in upcoming posts whenever we migrate an app and also share the potential impact.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 15:44
  • @WouterdeKort Okay. For stack overflow.com will you still be able to read it? Thank you.
    – Starship
    Commented Nov 13, 2024 at 15:45
  • 3
    @WouterdeKort I think the question is "would there still be the backup server while migrating, allowing the site to still function in read-only mode"? Commented Nov 13, 2024 at 15:48
  • 7
    We'll get there once we start migrating SE sites. Currently we're only going to do supporting apps. We haven't figured it out yet but yes, we are going for minimal downtime and yes, it will probably be in read-only during a database move.
    – Wouter de Kort Staff
    Commented Nov 13, 2024 at 15:51
2

Could the company clarify if there will be an archived version available, is the internet archive (assuming it will still exist) going to host this?

3
  • 7
    An archived version of what? We have no plans to stop providing data dumps, etc.
    – Adam Lear StaffMod
    Commented Nov 20, 2024 at 21:36
  • That's all I wanted to know, thanks @AdamLear
    – W.O.
    Commented Nov 20, 2024 at 22:21
  • There are unofficial community reuploads of the new versions of the SE data dump, after the data dump changes. The company does not care, so the community does the caring for them Commented Dec 3, 2024 at 16:35
2

As a skeptic, this seems to be purely VC driven, rather than out of any practicality.

I'm old enough to remember what Journeyman remembers, and how SE has been safe when the cloud dissipated.

I think you're losing out on a lot by completely moving away from having your own hardware, though I think it's entirely reasonable to build the infrastructure that allows you to "drag and drop" your servers.

But I also wish your SREs a very bland time of it.

5
  • 7
    "Wishing (someone) a very bland time of it" was not in my lexicon before, but it is now.
    – Spevacus Mod
    Commented Dec 5, 2024 at 19:56
  • 1
    Why would VCs want them to move to the cloud? I don’t follow that.
    – Jeremy
    Commented Dec 5, 2024 at 20:34
  • 2
    The same reason that VCs throw money at AI and .com bubbles. Or get companies to buy tens of thousands of dollars of swag for employees. Because their VC buddies are doing it. Commented Dec 6, 2024 at 14:02
  • 5
    I can say with 100% certainty this has nothing to do with any VC wanting something to happen. This is our decision. Feel free to disagree with that but it has nothing to do with a VC (or anyone else for that matter) forcing us to do this.
    – Wouter de Kort Staff
    Commented Dec 6, 2024 at 14:57
  • Fair enough! (Though, I'll still be suspicious of any efforts that exclude owning some hardware. Even if it's not a top-down decision, I'll still attribute it to VCs) Commented Dec 9, 2024 at 15:31
1

Best of luck, though, is this a greenfield project? Seems like doing a proof-of-concept on-premises using something like K3s or OpenShift would give confidence in the possibilities to decide if fully migrating the applications is worth it (I believe it will be, but you might run into a few snags, particularly around networking and long-term storage).

Another avenue would be using something like Vagrant / Packer to build-up VMs (or containers), then use Cloud Run / AppEngine rather than all-in on GKE.

Hashicorp Nomad is also, arguably easier, than needing a container-first orchestrator (ref blogs about how many engineers are used to maintain hundreds of nodes in a Nomad environment; hint - I've seen mentions of less than 5, and I've personally been on such a team)

2
  • 4
    We already run Stack Overflow Teams on Kubernetes. Teams and the public sites are the same code base. We're seeing a lot of advantages from containers over VMs and we do everything we can to not end up with VMs (except for SQL Server). So no, it's not completely greenfield. GCP is new but it's not that different from other clouds and we manage it basically the same way as we do Azure. We wanted to avoid having to run Kubernetes on-premises with all the complexity that brings.
    – Wouter de Kort Staff
    Commented Nov 18, 2024 at 8:56
  • 1
    What's perceived as the biggest headache managing on prem k8s? Commented Nov 18, 2024 at 15:24
-5

When going GCP, consider moving from outdated technology a.k.a. SQL Server to some modern, scalable database such as Cloud Spanner. Entire Gmail is on Spanner, so it would also easily handle the load of all Stack Exchange sites.

4
  • 14
    Changing two things at the same time is a good way to ensure that things break, and that they stay broken for a long time.
    – Mark
    Commented Nov 20, 2024 at 2:07
  • 14
    "outdated technology a.k.a. SQL Server" Citation needed.
    – TylerH
    Commented Nov 26, 2024 at 15:53
  • 4
    What is the business case for switching to a different platform with different requirements, different supported languages, and different costs? The fact that Spanner can handle very large data sets is only an advantage if you actually NEED those very large data sets.
    – barbecue
    Commented Nov 27, 2024 at 15:07
  • 4
    @TylerH - SQL(server) is old, known, reliable and boring, therefore outdated. Q.E.D. /s
    – Orangutech
    Commented Dec 2, 2024 at 3:01

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .