Jump to content

Server down issues and data loss, changes for the future and improvements


brunyman

Recommended Posts

  • Founder

As some of you already know, this morning one dedicated server had issues. Minecraft servers affected are FTB StoneBlock2, FTB SkyAdventures, Network AssassinS and Creative servers. We got this server less then 1 year ago when intel gen 9 processors were finally available, we actually got 5 such dedicated servers at the beginning of this year and unfortunately we had issues with 3 of them during this time, we changed sata cables and hard drives as 2 servers were time to time having HDD failures and degraded raid arrays, to replace burned power supply and even the entire server except for the hard drives. So a lot of lost hours and headache.

As we think of upgrading our servers usually at the end-beginning or a new year to plan our expenses for the new year it is obviously that these intel gen9 servers have a lot of issues, mainly I think the issues were caused by the motherboards chipset/controllers at least for the raid issues, so the plan is to upgrade to new servers this time running ryzen, gen3 is available for now in datacenters, we already tested one for the last 4 months and even if the single thread performance is a bit lower, multithreaded is way better then intel and we also had 0 issues so far, where on intel just 1-2 months after we got them crashes started to randomly happen. Another big performance improvement par of the upgrade plan is to move on from HDD to fast NVME SSD's, we will have a bit less room for backups but the plan is to build a NAS server and download all backup directly from the servers and keep them all offsite in case if an entire server is lost and cannot recover data we have backups on our NAS to restore from with max few hours lost of data, we will also use these backups for rollbacks.

Almost 2 months ago I requested cancelation of 2 dedicated servers, both had repetitive issues with raid arrays where the second drives fails and the raid enters a degraded state, we then reboot into rescue mode, fix partition errors and rebuild the array and then few days or 1 month later it fails again and so on. The drives SMART data show no issues. So we first requested to cancel these 2 contracts where issues are active and based on contract terms the cancelation is after 4 weeks. This week I ordered our first new dedicated server on new specs with Centos 8 and NVME SSD's, got it setup and this morning after we got GT server finally released I planned to move everything from the SA/SB host to the new machine as the contract ends tonight. But when I woke up SA and SB servers were down and crashing on startup with weird read permissions errors, then I checked AssassinS was also in the same crash state and even system commands were not working so I did a system reboot thinking that maybe the OS got bugged. After the reboot everything was working fine so I started to transfer data to the new server. While the files were coping to the new server I noticed that some configs I had setup before I went to bed were back to the original state and then I noticed that the files dates were 1 month ago, I checked backups and all backups from this month were missing. The Raid array was already into a degraded state with the second drive out of sync, and this failure happened 1 month ago so it was clear that somehow the system was loading the drive that was out of sync and not in the raid array anymore. This is so weird, I spent 8 hours to try to get access on the good data partition to copy my up to date backups but it just failed to recognize the parition on the first drive, like something badly corrupted at the base of the file system.

root@grml /home # cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md126 : active (auto-read-only) raid1 sda2[0]
524224 blocks super 1.0 [2/1] [U_]
bitmap: 1/1 pages [4KB], 65536KB chunk

md127 : active (auto-read-only) raid1 sdb4[1]
1944467456 blocks super 1.2 [2/1] [_U]
bitmap: 15/15 pages [60KB], 65536KB chunk

I also contacted the datacenter technicians and when they saw the above data, they were like we never saw anything like this we will get a senior technician here to see if they had experienced something like this before to try to fix it. But after I spent 8 h tring to get that drive data out I sure that we can't get anything restored. So all that is left was to use my last resort monthly backps to restore something and that was downloaded on 8 November. So we lost 19 days of data :(

In order to prevent this in the future we will build our own NAS server so all backups from all machines are saved directly there this way if something like this is ever to happen again we will have max few hours of lost data. We will also continue to replace all existing servers in the next months and upgrade to fast NVME SSD's. 

Now all the servers except for Nework Creative are back online and running on the new server. The creative server was updated recently to 1.16.4 and this incident lost all that work so we will have to again update it to 1.16.4. The AssassinS server was pending a WIPE as it also needs to be updated to 1.16.4 and we will do so next week.

I'm very sorry for this loss, for SkyAadventure and StoneBlock 2 servers we will do our best to refund affected players with items, in-game money and rank re-activations to cover the loss.
We created a free compensation package for SkyAdventure and StonBlock2 servers on DailyRewards category on our store, you can only claim this reward once and it will give you in-game money and crate keys, check the links below

Data loss compensation for SkyAdventure:  
https://craftersland.buycraft.net/category/1279398

Data loss compensation for StoneBlock2: 
https://craftersland.buycraft.net/category/1309676

If you had a rank or any purchase lost, request here it's reactivation: 
https://forum.craftersland.net/forum/223-rank-transfer/

StoneBlock2 item refunds:
https://forum.craftersland.net/forum/299-technical-support/

SkyAdventure item refunds:
https://forum.craftersland.net/forum/287-technical-support/


Thanks for you support!

 

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site you agree to the following Terms of Use, Guidelines and Privacy Policy. We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.