Today Scott Guthrie announced the public preview of Read-Access Geo-Redundant Storage. Before we start talking about failover, let’s take a quick look at how you can create a Storage Account:
- Locally Redundant: your data is stored in three synchronous replicas within a region. This is the cheapest option.
- Geo Redundant: The same as Locally Redundant, but in addition to that it will asynchronously store your data in three replicas in a secondary location. This means if your data is stored in East US, it will be replicated to the West US data center and be stored there an additional three times.
The data stored in the secondary location was only accessible after contacting support, but with the preview of Read-Access Geo-Redundant Storage or simply RA-GRS (simply?) you’ll now get read-only access to this replicated storage account.
“Real” Failover
Traffic Manager has been available for a while now. But if you ever really used it, you’ve probably been thinking, “what about my data?”. For the Windows Azure SQL Database we have the Data Sync functionality (does this thing really work?), but up until now there was no out-of-the-box solution to make your data available in a different datacenter.
Let’s make a simple application that takes advantage of RA-GRS and the Traffic Manager.
The Wall
So I’ve built this little social application (this must be social-app #298439292) which allows you to post messages on someone’s wall:
Image may be NSFW.
Clik here to view.
This is a simple application which runs in the West-Europe datacenter (http://wallapp.cloudapp.net/) and which uses the Table Storage for storing the messages on the wall. These are stored in the wallprod Storage Account, also located in the West-Europe datacenter. Now in case the West-Europe datacenter starts having issues with Compute or Storage, my application will go down.
Traffic Manager
The first thing we’ll want to do is create a new Cloud Service in a secondary location. Since my application is deployed in the West-Europe datacenter I’ll deploy the “spare” application to the North-Europe datacenter.
Image may be NSFW.
Clik here to view.
Now that my spare Cloud Service is online I’ll configure Traffic Manager. I created a new profile called “wallapp” in which I added my Cloud Services as endpoints.
Image may be NSFW.
Clik here to view.
As you can see I’ve set my profile to Failover mode and the wallapp.cloudapp.net Cloud Service is the first endpoint in the priority list. As soon as this Cloud Service goes down Traffic Manager will kick in and the wallapp.trafficmanager.net endpoint will point to wallapp-failover.cloudapp.net. Note that in a real scenario I’ll have something like www.thewall.com pointing to wallapp.trafficmanager.net
In order to test the failover I’ll simply stop the wallapp.cloudapp.net Cloud Service. As soon as the Traffic Manager notices that the application is offline it can take up to 30 seconds (DNS TTL) for me to be forwarded to the failover environment. I can test this by navigating to: http://wallapp.trafficmanager.net/ or by pinging wallapp.trafficmanager.net (even if PING isn’t enabled, you’ll see that wallapp.trafficmanager.net points to wallapp-failover.cloudapp.net):
Image may be NSFW.
Clik here to view.
Now if I visit my wall you’ll see the following message:
Image may be NSFW.
Clik here to view.
Let’s see why we’re doing this…
What about my data?
Ok so what are your options when we’re talking about failover:
- No failover, the easiest and cheapest option.
- Making reads and writes available from multiple locations (spanning different scale units). This means you’ll have more than one master and you’ll be responsible to keep the data consistent between all replicas. The best option but probably also the most expensive one.
- Provide a “degraded” version of your application or specific features in your application. This is what we’ll be doing.
Using the Traffic Manager our application is able to failover to a secondary location. But if there’s an issue with Storage our application will still break. That’s why it’s useful to also enable RA-GRS on our storage account. Since this is a preview feature we’ll need to activate it first: http://www.windowsazure.com/en-us/services/preview/
Image may be NSFW.
Clik here to view.
After the preview feature is active you can enable RA-GRS on your account:
Image may be NSFW.
Clik here to view.
As you can see the Secondary Region for my Storage Account is North-Europe. Connecting to the Read-Access Storage Account in the Secondary Region works by convention. Just add -secondary to the name of your Storage Account and use the same keys to connect to the Storage Account. In my case I’ll be connecting to wallprod-secondary.table.core.windows.net
Configuration
Since I’m working with a Cloud Service I can take advantage of different Service Configurations to configure how the Failover version of the application should work. Start by right clicking your Cloud Service, choose Manage Configurations and take a copy of the Cloud configuration (I called it CloudFailover). And what I’ve done there is the following:
Image may be NSFW.
Clik here to view.
- I made sure the storage account points to the secondary Read-Access version.
- I added a property called IsFailover and set its value to 1. This allows me to disable certain features or show warning messages.
- I changed the Diagnostics Storage Account for my failover configuration to use a Storage Account in North-Europe (you wouldn’t want diagnostics to write to the primary location, which might be broken)
This means that the failover deployment will connect to the read-only Storage Account. It could be possible that some data is missing (maybe the replication wasn’t complete before Storage in West-Europe went down), but at least my users will still have access to the application (even though some features might not completely work).
Now since I have a setting which defines if an application is deployed in a failover environment or not I can access this setting in my web application and use it to show notifications or to limit access to specific features. This is how I’m showing the notification that posting a message is not possible:
@if (ViewBag.IsFailover) { <div class="alert alert-danger">We're having some technical issues. Until we solve the issue, you won't be able to write new posts.</div> } <div class="jumbotron"> <h1>@Model.Username's Wall</h1> @if (!Model.Messages.Any()) { <p>There are no messages. Hurry up and post a message!</p> } </div>
Or how we’re restricting access to the “Post a message” feature:
public class MessagesController : Controller { protected override void OnActionExecuting(ActionExecutingContext filterContext) { ViewBag.IsFailover = RoleEnvironment.GetConfigurationSettingValue("IsFailover") == "1"; base.OnActionExecuting(filterContext); } public ActionResult Index(string username) { return View(new MessagesModel() { Username = username, Messages = MessageService.List(username) }); } [HttpPost] public ActionResult Post(PostMessageModel model) { if (ViewBag.IsFailover) return View("ReadOnly"); MessageService.Add(model.Username, model.Subject, model.Body); return RedirectToAction("Index", new {username = model.Username}); } }
And that’s it. Now I can deploy my application to the failover environment with a specific Service Configuration where the IsFailover option is set to 1. This will cause the application to run in degraded mode (showing messages and limiting certain features).
Considerations
When the Traffic Manager monitors your application it will connect to your homepage by default. But if your homepage doesn’t use storage (or Service Bus or whatever…) Traffic Manager might think everything is OK while it’s not. That’s why it’s important to have a custom health probe (like /HealthCheck.aspx) which also checks if the services your depend on are working correctly.
When you build an application keep in mind that systems will fail eventually, so you better come prepared. Read-Access Geo-Redundant Storage and the Traffic Manager make this a lot easier to do.
More information: