[RFO] [2023-04-17] Public-facing services outage - rocky-announce

21 Apr 2023


      On April 17, 2023, we had planned maintenance that was aimed to be transparent in nature. In our current work of
standing up new infrastructure, we needed to migrate Rocky Linux 8 to 9 for our FreeIPA domain (that runs our DNS and
auth for our services internal and external).
The plan was to essentially:
* Remove one node
* Add new node, configure as needed via ansible
* Repeat the above until all nodes are migrated
As FreeIPA handles the internal DNS and responds to internal requests from our external services where necessary, this
should be fairly routine. Most internal services won’t notice IPA servers coming and going as the SRV records will have
changed and things will move on as normal.
This is mostly transparent, following the proper documentation for FreeIPA migrations between major releases of
Enterprise Linux. However, among this work, our internal firewall and unbound DNS caching were not made aware of the new
IPA systems as we added and removed nodes, and thus the problems began to start and eventually cascade externally. The
remaining 8 node that was still available was the only server responding to requests from our haproxy and unbound.
Though this was the case, this essentially caused the following issues:
* The CDN would fallback to our appropriate parameters to ensure mirror manager would still respond with at least one
     mirror (dl.rockylinux.org)
* The CDN would detect our services to be back up briefly
* Some users will have success, while some would time out.
* Anyone hitting mirrors.rockylinux.org would eventually timeout
* The CDN would detect it down again and try to fallback
* The above would loop endlessly
This also unfortunately prevented us from being able to login to our VPN by normal means. The appropriate infrastructure
contacts were notified to assist in getting in and fixing the internal resolver to bring all services back online. After
all services were back online, we were able to migrate the final 8 node to 9, without further impact to our
infrastructure and our users.
Corrections have been made to the infrastructure configuration to provide better fault tolerance in the case of DNS
outages like this.
If you have any questions, please reach out.
Best,
Neil Hanlon
Infrastructure Team Lead, Rocky Enterprise Software Foundation