• Nighed@sffa.community
    link
    fedilink
    English
    arrow-up
    30
    ·
    8 months ago

    Surprised a company of their scale and with such a reliance on stability isn’t running their own data centres. I guess they were trusting their failover process enough not to care

    • brianorca@lemmy.world
      link
      fedilink
      English
      arrow-up
      10
      ·
      edit-2
      8 months ago

      They probably need to be in so many different locations, and so many different network nodes, that they don’t want to consolidate like that. Their whole point of being is to be everywhere, on every backbone node, to have minimum latency to as many users as possible.

  • draughtcyclist@programming.dev
    link
    fedilink
    English
    arrow-up
    25
    ·
    8 months ago

    This is interesting. What I’m hearing is they didn’t have proper anti-affinity rules I’m place, or backups for mission-critical equipment.

    The data center did some dumb stuff, but that shouldn’t matter if you set up your application failover properly. Architecture and not testing failovers are the real issue here

  • DoomBot5@lemmy.world
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    1
    ·
    8 months ago

    This reminds me of how AWS lost critical infra when us-east-1 went down. That’s including the status dashboard that was only hosted there.

  • kent_eh@lemmy.ca
    link
    fedilink
    English
    arrow-up
    11
    ·
    8 months ago

    I’ll be curious to learn if the battery issue was due to being under-dimensioned, or just aged and at reduced capacity.

  • JakenVeina@lemm.ee
    link
    fedilink
    English
    arrow-up
    11
    ·
    8 months ago

    the overnight shift consisted of security and an unaccompanied technician who had only been on the job for a week.

    That poor bastard.

  • Scott@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    7
    arrow-down
    1
    ·
    edit-2
    8 months ago

    It isn’t Flexentials year.

    They got burned by DediPath.

    They got burned by NextArray.

    They just got ousted by Cloudflare.

    • Nine@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      8 months ago

      If it keeps up it’s going to someone is going to be making 3 envelopes….

  • Dr. Dabbles@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    8 months ago

    It was poor design. Poor design caused a 2 day outage. When you’ve got an H/A control plane designed, deployed in production, running services, and you ARE NOT actively using it for new services let alone porting old services to it, you’ve got piss poor management with no understanding of risk.

  • AutoTL;DR@lemmings.worldB
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    2
    ·
    8 months ago

    This is the best summary I could come up with:


    Cloudflare’s main network and security duties continued as normal throughout the outage, even if customers couldn’t make changes to their services at times, Prince said.

    We’re told by Prince that “counter to best practices, Flexential did not inform Cloudflare that they had failed over to generator power,” and so didn’t have a heads up that maybe things were potentially about to go south and that contingencies should be in place.

    Whatever the reason, a little less than three hours later at 1140 UTC (0340 local time), a PGE step-down transformer at the datacenter – thought to be connected to the second 12.47kV utility line – experienced a ground fault.

    By that, he means at 1144 UTC - four minutes after the transformer ground fault – Cloudflare’s network routers in PDX-04, which connected the cloud giant’s servers to the rest of the world, lost power and dropped offline, like everything else in the building.

    At this point, you’d hope the servers in the other two datacenters in the Oregon trio would automatically pick up the slack, and keep critical services running in the absence of PDX-04, and that was what Cloudflare said it had designed its infrastructure to do.

    The control plane services were able to return online, allowing customers to intermittently make changes, and were fully restored about four hours later from the failover, according to the cloud outfit.


    The original article contains 1,302 words, the summary contains 228 words. Saved 82%. I’m a bot and I’m open source!