Summary for Riot's AMA about EUW Service Issues

Posted on at 5:09 AM by Moobeat
As promised, Riot's tech leadership hosted an Ask Me Anything about all the server problems that have been plaguing EUW for the greater part of a few months.
Continue reading for the scoop on the problems and a peek at the technological side of things.

Demorphic started things off by reiterating the purpose of this server problem AMA and introducing the Rioters who participating in the AMA.
"Hi everyone!
I'm Matt "Demorphic" Elliot, and as promised we will be kicking off an AMA today at 6pm UK time with three members of Riot's senior technology leadership:
  • Scott "Scott Thru The Heart" Gelb, Vice President of Technology
  • Ron "sonicdeathriot" Williams, Vice President of Operations
  • David "RiotBanksy" Banks, Technical Director
  • Joining us will also be Tomasz "Riot tmx" Ankudowicz, our Live Services Producer for Europe.
We're here to talk with you about the recent service disruptions you've experienced on EUW, and to answer any questions you have about how the League of Legends service operates, and why it sometimes breaks badly. 
I'd humbly ask that you focus your questions on the service/technology side - the Rioters with me today are super busy working on improving the network infrastructure behind League of Legends to avoid issues like this in the future, and they don't often get a chance to hop in and take your questions. 
They will also be fielding questions from multiple European languages, so please bear with us. Realistically we will not be able to answer all questions live. However, we have commited to following up this session with a more comprehensive set of questions and answers before this coming weekend. 
We're going to be doing more deliberate community communication like this moving forward - we understand that not getting this kind of face time with the team at Riot HQ is also a big point of pain for EU players, and we hope this is a valuable step in the right direction.Please kick things off and we will be jumping in at 6pm UK time. You then have our undivided attention for 2 hours. So fire away!"

Now on to the questions! As usual I've tried to clump like minded questions together and format things so the questions flow. It should be noted that the questions in the brackets [ ] are my interruptions of the questions, not the word for word questions asked.


The Problem, EUW, and Communciation

What is the extremely tech-lingo filled reason for the problems? ]
"I wanted to share a big wall of text for anybody who'd like to go deeper into the tech today. :) This write-up describes some of the big problems that were impacting all the regions in Europe (and other platforms outside of Europe) until about mid-April (and more recently for Korea): 
EU Core Switch Summary 
Since late October there have been several major outages in Europe that were related to scaling problem with our core switches in Frankfurt. Our Korea players also suffered from this issue recently along with some other painful problems. Last week we finally figured out the root cause of this painful issue. The root cause around the malfunction of the core network switches was that our game servers were redundantly connected with two network interface cards (NIC) that were setup in an active/active load balancing mechanism we use in all of our environments around the world.* The network load balancing was accomplished through the use of a “smart” algorithm that dynamically moved traffic from one NIC to the other.* Since each server was connected to the network twice, every time the server flipped its active NIC, the entire Europe network would have to change its network packet forwarding tables to account for the change.* Specifically the MAC address table inside the switches needed to be updated constantly.* Typically updating a MAC address table isn’t as issue with just a few servers, but it becomes an issue as the server count increases when those severs are using “smart” load balancing. * When the game servers change their active NIC this has to be propagated across the entire network.* It starts from the top of rack switch that the server is directly connected to, sends the change to the core network switches and then is distributed out to all devices connected to the network. As network traffic increases, the servers will swap the active NIC more often to distribute the load, which causes the MAC address tables to be updated at a rate that cannot be handled by the core network switches.* In the case of our Eruope (and Korea) game servers this means the MAC address table was updating millions of times per second during peak playtimes. North America was spared this pain because we have fewer game servers on the core switches than we do in Europe (or Korea).
This is where the network flooding starts that would eventually crash the core switches that support the entire Eruope service. *Inside the core network switch, the MAC address table stores a MAC address mapped to a physical network port that is uses for forwarding and filtering network traffic. **As the rate of updates exceeds the capability of the switch, the switch falls back into a fail-safe mode where it sends the packets to all ports on the switch.* Traffic that is normally only sent to the destination that it addressed for, is now being broadcasted across the entire environment.* As a result all of our servers started getting lots of unneeded network traffic.** This had the net effect of essentially creating a denial of service attack across the entire network.* Welcome to meltdown city. *And when the core switches crashed they did not keep any diagnostic information we could access to help troubleshoot the problem.
** 
To mitigate the above issue, we have done a few things: 
1.*******Short Term Fix: *Disable secondary NICs on the game servers. The idea behind this is that if there is 1 NIC and 1 related MAC address per game server (instead of 2), this completely removes the possibility of traffic jumping around from one NIC to the other.* The network will no longer have to update itself as a result of any server side load balancing. 
2.*******Long Term Fix: Change the environment to use active/passive failover on the NICs.* The server will have a backup connection but it will only use it in the case of the primary path failing.* This will stabilize the environment greatly but still allow us to provide high availability in the case of a localized failure.
3.*******Upgrade the core network infrastructure with hardware that can handle a huge amount of additional load.* Why not, I mean, more power is better, right?! 
a. Upgrade the core switches so if we get surprised again with extra traffic for whatever reason, we are ready for it. *Different designs can keep up better with MAC address table updates which would allow us to better handle the load should something similar happen in the future.* 
b. An improved design will also reduce the chance of one switch failure crashing everything in the data center. 
Current Actions: 
We have been working on a daily basis with our network switch vendor (the same one used by the leading Internet companies and other leading websites). During the outages we’ve been passing the switch vendor logs we gather as we make changes to the Europe (and Korea) network infrastructure, and have been working alongside of them to diagnose the root cause of why the core switch isn’t able to handle the rate of change to the MAC address table. * We have also been working with the vendor to expedite the new network gear to upgrade Europe's network, gear we placed orders for several weeks ago, but due to the very big capacity of this new network gear, it is built to order by the manufacture. Some of this gear is already in installed and has helped stablize EUW and EU Nordic's network since mid-April 2013.

We are sorry of this inconvenience, it has been a very difficult problem to track down and fix. We have learned from this issue and are building a better network to protect the quality of the service in Eruope and in our other service areas." - sonicdeathriot

[ Can the PBE used to avoid these problems in the future? ]
"We do use PBE for testing our deploys, patches, hotfixes, etc. It's in constant use for client-based issues and new releases are always first deployed to the test server and many bugs are caught this way. It does not cover all the re-configurations (this word is unfortunate, but I'm talking about all platform changes, firewall policies, drives, front and backend applications, database modifications, Coherence tweaks, etc) we perform on our network and system infrastructure. On the other hand, PBE lives in our datacenter and is hosted from some of our standard game servers, so in fact it is a part of our bug investigation." - Riot tmx

Can you explain why EUW has so many problems compared to other servers? ]
"Service quality across the many regions League of Legends runs is actually about the same when looked across a full year of data, however sometimes a certain region will have a bad run of issues that can last several days. Also, when issues happen our biggest regions do tend to have extra player impact when issues occur. This is due mainly to: 
1. When a big region has an issue that kicks a lot of players out of the game or off the platform it puts a lot of stress on the systems that let players back into the system. This creates complications that lead to the need for us to limit the number of players per second we can let back into the service which means in large outages it can take several hours to let everyone back into the service. Also when this kind of load is placed on the system it can sometimes create secondary issues that will extend the unplanned downtime. There is a lot of work going on in our Operations and Development teams to greatly increase how many players per second we can let into the service and also improve the stability of the service when it is under stress when huge numbers of players are trying to log back in after an outage at peak times. 
2. Our biggest regions like EUW are also our most complicated infrastructures to maintain. This means that new release can take longer to deploy, troubleshooting problems can take longer, and there are just more things that can break, be misconfigured, or have other issues. EUW and EU Nordic are both currently being rebuilt in Amsterdam on brand new servers, network, and secuirty gear (note that game servers will continue to be availble in Frankfurt with even more being added in Amsterdam). This will give players the latest architecture we have been testing for the last several months to ensure EUW and EU Nordic can continue to grow larger and have even better service stability. It is a huge undertaking but coming soon." - sonicdeathriot

How big are the EUW servers? ]
"Our EUW data center in Frankfurt is pretty astounding. We rack thousands of game servers globally and house many of them there. In EUW, we are handling many millions of game data packets per second. We are constantly making improvements to network infrastructure, hardware and optimizing our software to handle the ever growing enthusiasm for League." - Riot Banksy

[ What databases and what sort of teams manage such huge systems? ]
"We're on various DBs, and beside Microsoft SQL and some other premium brands we also operate on fancy versions of well-known MySQL. 
All services are hosted from different servers. Actually, clusters of servers divided by functionalities. Let's take chat: once creating the connection to the platform you're first reaching the load balancer, which creates a connection to given chat server. It handles the traffic in the most efficient way, so if one of the chat servers suddenly drops, it does not affect all player base on the platform. 
We're working 24/7. We have our Network Operations Centers all over the world and some of them work in 3 overlapping shifts. Problem might be handled by the team in Santa Monica, or Dublin, or Seoul, or Istanbul or Rio. Additionally, we have our on-call engineers on various continents focusing on every single incident happening anywhere. It doesn't really matter if we're hit in Singapore, Brazil or Europe. The team is always there and people are back to their workstations instantly." - Riot tmx
Why not just get more servers? ]
"We are constantly making investments in new testing tools, new monitoring, and higher capacity equipment to handle the growing Europe player community. Currently we actually have enough server infrastructure and constantly installing more to stay ahead of the growth. Our recent issues are not related to server capacity. When we do run into capacity issues they usually involve either network gear tuning, network security adjustments, or tuning the the software/databases that support the service. It can be very difficult to simulate the load our large Europe player base can place on the network and servers for each region, so sometimes we run into unexpected issues during peak service times that require further troubleshooting and tuning." - sonicdeathriot
[ If it is too big to handle, are there any plans to split EUW? ]
"We currently do not have plans to split the EU-West platform. We are constantly upgrading the infrastructure and adding additional capacity to support the growth of EU-West, one of the largest League of Legends regions in the world. We recently addressed one of the technical challenges we've faced with the scale of our core network. Ron Williams, our VP of Operations, is going into more detail on the specific issue and how we’ve solved it in a separate post (http://euw.leagueoflegends.com/board...23700#11923700). We have engineers working around the clock to fix the stability issues and dedicated teams that are focused on scaling our systems proactively to support the live service." - ScottThruDaHeart
Are you sure you don't want to split EUW ? ]
"We don't want to split EUW, I think this has been messaged out already. What we do instead is to improve our infrastructure even more. We already have top quality devices in place, but life brings us new solutions every month and we're constantly upgrading our systems. We recently performed large maintenance in the Frankfurt data center where we completely re-designed our network infrastructure or replaced all the database drives with the high-end solutions. It all gives us some additional power and if you take a look a year or two behind, we're now able to provide service for over 100% more players comparing to the past. Also, not sure if you noticed, getting back on EUW platform is now quicker than ever, as we're able to throttle players in much faster in the peak times or during the incidents." - Riot tmx

How will you tackle similar problems in the future? ]
"There are always going to be new challenges as we keep growing League of Legends. However we continue to make investments in new data centers, new equipment, and hire smart engineers to provide the best player experience we can." - sonicdeathriot

What are your plans to improve communicating server problems with players?]
"Beside keeping the platform in good health, communication is our main goal. It always was, but we know we didn't really do great job here. So, about plans: 
1) We're working on standardizing Live Incident communication in all languages. These will be high quality, automated and localized announcements posted regularly during the incident and we know it's going to work better, than our current system. We're still finalizing the scripts. 
2) We introduced new program called Post-Incident Messaging. Every time our platforms come through the issues, we're going to tell you about it within next 24-48 hours. We will summarize our efforts, tell you what we've done, what kind of problems we noticed, what was the impact and how did we mitigate the incident. We will be actively answering all your questions and we really want to put our best effort into this. 
3) We're also working on improving the communication within the client, as we know that some players don't even check forums for service status. More info coming soon." - Riot tmx
[Will you be able to keep this promise of better communication? It hasn't worked in the past. ]
"Since we're in Europe and we work on LoL in 10 different languages, it's quite demanding to support forums in all languages the way we all wish. We have our Community department, but we feel it's not enough. Obviously we're going to improve and we'll eventually reach our point, but it's quite different comparing to our US HQ, where most of Rioters speak and comment in English. Our plan for this is to encourage the entire Dublin office to be more active on forums: to be part of discussions, to talk to you about champions, new maps and other features. Obviously we're all swamped with our daily work, but being active on forums is another part of our work. And we want to be closer to our community, for sure. Also, we're actively hiring more folks to our Community department in all languages, so this should be noticeable pretty soon." - Riot tmx

Compensation

Why can't we have RP instead of IP Boosts for compensation? ]
"We've never given RP as a widespread compensation for server instability. There was one case where less than 1% of NA players were unable to access their purchased content for more than a week and we compensated them with RP. To reiterate: we have never given out RP for any server instability in any region." - Riot Pwyff

But Riot Pwyff, THERE HAVE BEEN cases of RP given out for instability!! ]
"AM CAUGHT. In this regard I'd rather not argue technicalities (it's DDoS attacks!), so I'll just go ahead and eat this humble pie. 
Apologies, this is actually a demonstrable case where we've given out RP for this specific type of server instability. On that note, giving out RP to our players has problems outside of cost (of which IP boosts actually have a monetary cost as well), so this isn't a simple case of greed.

Final note, and not meaning to be too jovial here, but I find it rather ironic that the one demonstrable case of giving out RP for server instability was... Europe. Sorry. HAD TO POINT IT OUT." - Riot Pwyff

[ What about the compensation? Can we get not IP boosts?? ]
"Moving forward, compensation is a concept we're going to have to reevaluate. It's clear that IP boosts arent good for everyone anymore. Also legacy skins aren't realistic for established platforms. This causes as much pain as it solves. 
Should we have service disruption in the future, we'd like to be able to offer an apology that's satisfactory to as many players as possible. We've heard your feedback loud and clear. 
All I can do is promise that we will be smarter about communication both internally and here with you guys when it comes to compensation.

We get you if you don't accept this apology, but we still hope to earn back your trust in the future." - Demorphic


When will we see our 20 win IP boosts? ]
"We've started running the script to add the 20 Win IP Boost to your accounts, but it might take up to 3 days to complete, just because of the large number of players on EUW. Thanks for your patience!" - Demorphic


[ I lose LP while playing ranked during all these problems, can you restore it? ]
"I couldn't find a direct post anywhere in here, but there are questions regarding LP loss for server instability. On that front, because it's just so difficult to clearly prove a loss due to server issues (and at what point we apply a cut-off point) that retroactively compensating LP is nearly impossible for the entire player base. We do turn on loss forgiven mode as soon as we know issues arise, but there can be delays and we apologize for that." - Riot Pwyff

Odds and Ends

Will we see better forum features in the future? ]
"New forum features are on the way, can't tell you exactly when but I've seen some great new features running in test environments." - sonicdeathriot

Can you please stop releasing champions and instead work on the server stability? ]
"These are two different aspects of game service. Game content is being created by teams of developers, designers, artists, writers, animators, etc. The entire 'server thing' is driven by Live Services department, which mostly consists of various kinds of specialists, from System Administrators, through Network, Security, Platform Engineers, to Database Administrators and Live Producers. Those two groups are independent and don't work on common projects. Well, we don't even sit together in the office as we're on different floors and in different buildings. Obviously our game is what spans them all, but one project can be ran simultaneously to the others. We always say you don't want your artists doing brain surgery. Releasing a new champ does not block us from running and improving our live service." - Riot tmx

I'm still getting connection errors even though I've tried basically everything. Are you guys still working on this? ]
"Yes, we're still working on this issue and I'm sad seeing some players still having connection error. Our recent fix was based on the network configuration changes and we know it finally worked for majority of players. Connection Error does not happen to players facing this issue week ago. If you still can't connect and it lasts for days, please contact EU Player Support and ask them to forward your e-mail address over to me. I'll work with you directly to get it resolved. We did it in the past with some other players." - Riot tmx

Why do login queues exsist and why are they sometimes 30 minutes to an hour long? ]
"When we have a service-wide impacting technical issue with the live system, we utilize the log-in queue for a number of reasons: 
- to reduce load on the live service while we troubleshoot
- to protect the experience of our players who have already logged in
- to prevent a flood of players (i.e. hundreds of thousands) from overwhelming the live service
We have an internal initiative, "Nashor's Tooth", that's focused on optimizing our system to allow players to log-in and play as fast as possible.

We have a complex system with thousands of servers supporting millions of European players every day. We also continue to deliver new features and content as frequently as possible. As a result, we're making changes to both the software and hardware on a regular basis, which can introduce some risk to the stability of the live service. Our goal is to prevent 100% of the inevitable software and hardware failures from impacting our players. We already have many incidents every week that players never see (i.e. a server loses a hard drive, an ISP goes off-line), but we will not stop focusing on improving in this area." - ScottThruDaHeart

Lag sucks, please fix it. ]
"Lag is a difficult and frustrating problem to solve. It can be caused by many different issues some in Riot's control and some not. Here is what we do today to ensure the parts we can control are designed to minimize lag: 
1. We ensure we do not overload game servers with too many players. 
2. We keep our Internet circuits capacity well below their maximum network flow rates. 
3. We test each release to identify any in game bugs that could create lag and hotfix issues as we find them. 
4. We monitor thousands of metrics in our data centers to ensure we have a healthy environment. 
5. We continue to engineer tools to detect lag and diagnose lag root causes so we can address them. 
There are a few things players can do to minimize the chance of lag: 
1. Playing through wireless connections increase the chance of running into issues that can create lag. If you have to play through wireless, try setting your wifi router to a different channel. Most wifi routers sold ship on the same wireless channel, so if you around other people using wifi it can create interference that will affect the speed of your wifi connection from time to time. 
2. Ensure your PC's network is dedicated as much as possible to the game. If you are running other programs that use the network like music streaming, video streaming, file downloads, etc. these could be overwhelming your Internet connection for short periods of time.

3. Make sure you are seeing lag, video card framerate dropping can appear like lag. Check the FPS counter in the upper right corner of the game screen. If it drops below 20 you could see symptoms that look like lag but are actually related to something overloading your PC's CPU, memory, or disk resources. Shut down other things running in the background and if you are constantly seeing low FPS numbers then you might want to consider upgrading your video card or other PC resources." - sonicdeathriot
I play in Kenya and get really bad lag. Any suggestions? ]
"We don't have a service close to Kenya. So your ping time is going to be higher becuase you are so far away from Frankfurt located game servers. The Internet connects the world in complicated ways so you might try our new Oceania service when it opens this Summer and see if ping time from Africa is better." - sonicdeathriot
( Editor's note: Bolded something important; This should make you AUS players very happy. )

What happened with the custom item sets? ]
"Soon after the 3.7 release, we started to see potentially similar service disruption issues globally. We quickly ascertained that the root of the issue was network related and wanted to quickly nullify any variables that may be causing the issue. As a result, we decided to revert the 3.7 client to remove the possibility that we were in fact creating a self-inflicted problem. In the roll back to an older client, we removed Custom Item Sets. This feature will be returning in an upcoming release. -"Riot Banksy

No comments

Post a Comment