Joyent Weblog
Bingodisk and Strongspace: What Happened?
We have had a fantastic beginning to the year at Joyent. Our revenue continues to grow quickly. We have been gaining new Accelerator customers at a record pace. The Facebook deal is dramatically increasing the size of the Joyent community and is already paying off handsomely as successful start-ups upgrade their Accelerators to serving millions of pages to millions of users. The biggest Facebook application running on Joyent Accelerators now serves over 700 million pages per month. Yes, 700 million.
While the commercial grade Accelerator products have been growing faster than ever, two of our smaller, “prosumer” products hit a serious road-bump.
Ten days of downtime
The past 10 days have not been the best days at Joyent. Bingodisk and Strongspace went off-line 12 Saturday. Bingodisk service was restored eight days later on 19 January. Strongspace limped back into service late 21 January, nearly ten days after it went off-line. Customers of these services are rightly outraged by the outage. While Strongspace and Bingodisk represent a very small fraction of Joyent’s entire infrastructure, we understand how critical it is to many of you, and have been working and investing many, many hours to bring these services back on-line as expeditiously as possible. I apologize for the outages.
In this post I would like to report on what happened, how Joyent plans to compensate our customers, and what we plan to do in the future with Strongspace and Bingodisk.
Some Background: the Economics of Bingodisk and Strongspace
Strongspace was introduced in August, 2005 as an elegant multi-user storage solution using SFTP. It initially was deployed on EMC Clarion storage. The market for on-line storage was rapidly crowding and the price of on-line storage quickly dropped. We began the process of looking for a new architecture and hardware platform in order to remain competitive. With the Zettabyte File System (“ZFS”) in OpenSolaris and the introduction of the Sunfire X4500 (aka “Thumper”), we realized that we could build very competitive on-line storage solutions at costs that kept us more than competitive. Strongspace moved to ZFS in December, 2005 and onto a Thumper in October, 2006. We came out with Bingodisk, also based on ZFS and the Thumper, in September, 2006. Without ZFS and the Thumper, we probably would not have been able to continue Strongspace or introduce Bingodisk. The Thumper and ZFS provided the raw storage-to-controller ratios and ZFS the redundancy and data protection we required without having to spend, literally, hundreds of thousands of dollars.
I’m laying out this background to note that both Strongspace and Bingodisk were always designed to be (a) inexpensive, utility storage in the cloud, and (b) built on top of a filesystem and hardware platform that would ensure we would not lose data.
OK, enough preamble. What happened?
On 12 January, both Strongspace and Bingodisk went down because ZFS encountered what it thought was duplicate data on disc. The so-called “spacemap bug” (fixed in build 60 of OpenSolaris) apparently double-writes blocks. The problem arises when ZFS later realizes this and tries to free that which is supposedly already free. ZFS thinks that something went wrong and that it may corrupt data so it (correctly) panics. Once this loops gets going, it’s tough to break out of it.
We were conservative and methodical in how we moved forward from here.
We updated to the latest Solaris build and ZFS code so that issues from the bug wouldn’t occur again, and then set out to find and repair the problem areas while getting the services separated and running. The operating system updates went fine. The dataset imports and updates took quite a deal of time. One of the larger distractions early on in the process was a bug in the NCQ driver that made the SATA drives appear to have “issues”. We corrected the NCQ driver. We also performed a complete hardware swap-out (just to be safe). Every piece of hardware in the original Thumper was replaced with parts from a standby Thumper. In the end all the drives from the original Thumper (48 of them), ended up in a new Thumper. We had to do all this so that we could safely read the data off the original Thumper to bring the services back up on new Thumpers.
When it became clear that the data set under Bingodisk was fine and likely not where the issue lay, we moved all that data to new storage. We didn’t trust a block-level restore, so we had to read and write files, and writing that much metadata takes a enormous amount of time: about 1TB every 10 hours. We were able to get BingoDisk operational first.
Strongspace took longer because that was the dataset with the problematic area(s). Areas that took about 5-10 hours to expose themselves in testing each time. We are currently running ZFS for Strongspace in a state where we set ZFS so that it won’t panic when it hits the problem area, but will instead run a recovery.
We’re quite fortunate these problems happened to us with ZFS. ZFS at the very least gave us the confidence that our data is there and valid. No data was lost.
Some have wondered why we didn’t upgrade the operating system earlier. Upgrading the operating system is not a trivial task on a production system with so much storage in play. Further, the version of ZFS we were running on Strongspace and Bingodisk was more mature code than that code originally shipped in Solaris 10. This meant the code we had in production had gone through ZFS’s vaunted test bed. Finally, the likely scenario of an operating system upgrade would have been to expose the “spacemap” data errors on disk sooner, bringing down the services nonetheless. Once bitten…
Was there a backup?
Yes, and no. In the traditional sense of us writing the data from Bingodisk and Strongspace to tape or some other Thumper, no, there was no backup. Data redundancy is built into the ZFS/Thumper software/hardware combination. The Thumper is both server, and backup. Moreover, it’s hard to see how a backup of 18TB of data to another physical device would work, in practice. Moving Bingodisk to another Thumper during this crisis took 30 hours (3TB of data). A large, multi-tenant service such as Bingodisk or Strongspace with the amount of data they manage makes it practically impossible to do a meaningful backup. A single backup would take over a week. The backup process would kill end-user performance. A service like Strongspace, which people use to rsync their own backups, means the data turns over rapidly and an incremental backup would not make sense. ZFS has a facility, zfs_send/receive, that runs on an idle thread. There is currently no idea of giving priority to this functionality, so, again, practically speaking, this could not be used for backup.
Joyent Accelerators and Connector are backed up for disaster recovery daily. The datasets for each of these is much smaller and therefore fit into a practical backup scheme.
So, Bingodisk and Strongspace were backed up based on the redundancy built into the Thumper itself and the capabilities of ZFS. Fully 6TB of storage on a Thumper is dedicated to redundancy. ZFS’s capabilities to ensure no data loss were proven in this instance. These Thumpers sit in a telco-level data center (the best) that is rated to withstand a 9 richter earthquake. The fire systems in the data center itself mean the chances of the Thumper being lost to fire are statistically meaningless.
What is Joyent going to do for customers?
With the events of the past ten days, we’ve been doing some hard thinking about Strongspace and Bingodisk.
Here’s our plan for Strongspace. We’re not taking anymore sign-ups for Strongspace. The current Strongspace will be replaced by a new service (not named Strongspace) that will not have the economic model of the current service. It will be expensive, distributed and bullet-proof. The replacement service will likely be introduced before October, 2008. We will retire the current Strongspace on 1 October 2008. There is Strongspace functionality in Joyent Connector today, and that will remain. Customers currently on Strongspace will be allowed to continue to use the service for the next 9 months for free. If you bought Connector for Strongspace and only want Strongspace, please file a ticket. You’ll be allowed to remain on Strongspace for 9 months for free, but your Connector and Shared Hosting (or Shared Accelerator) accounts will be deleted. If you bought Connector for Strongspace and you want to keep your Connector and Shared Hosting (or Shared Accelerator) accounts, please file a ticket and you will get a coupon for 4 months of the new service for free. If you are a Mixed Grill (or similar) customer, we will be replacing the Strongspace component with the replacement product. Every current Strongspace customer will get a coupon for 2 free months (minimum) of the new service. If you just feel like saying “screw it, I don’t want to have anything to do with these guys”, please file a ticket and we will refund you for two weeks of down time.
We will be open-sourcing the current Strongspace. This will allow anyone to run Strongspace private label on any infrastructure provider they choose. After Connector, Slingshot, and our DTrace probes for Ruby, this is Joyent’s fourth major contribution to open source. We will continue to provide some infrastructure for the FreeStrongspace community and a test bed for installations, demos.
Bingodisk is used widely by people preferring HTTP over proprietary APIs to serve up static assets for web sites. Due to the downtime, we are giving Bingodisk customers four months free. In fact, anyone signing up for Bingodisk between now and March 1st will not be charged for two months. If you feel you don’t want to have anything to do with Joyent, please file a ticket and we will refund your annual subscription, pro-rated plus an additional two week. Bingodisk sign-ups are currently disabled, but we’ll be bringing that process back on-line this week. Bingodisk continues to have the same economic model of inexpensive storage and an industrial strength filesystem for data security. Over time it will be folded into Connector.
Bingodisk will also be open-sourced. Anyone will be able to run Bingodisk on any infrastructure provider they choose. This is our fifth major contribution to open source. As with Strongspace, we will continue to support the FreeBingodisk community through providing infrastructure and a test bed for installations, demos.
While these measures do not get back the eight and ten days of down time, I hope they do send the message that we value all of our customers. Again, I apologize for the down time.
———-
EDITED: Added end date on ’2 months for free’ promotion for new BingoDisk signups.
Commenting is closed for this article.
Thanks very much for your openness. I appreciate your enthusiasm on setting things right in this matter even though I think a few things could have been handled differently and much earlier here.
One question though: if I understand correctly you plan to migrate current Mixed Grill customers to the new SS-replacement product. In other words, there is not going to be any form of specific compensation for MG customers except for the plan to continue the service albeit in different form?
Please correct me if I am wrong.
— Bijan Kafi 110 days ago #Thank you for this post. I did not envy you your week! I had only just begun using the SS component of my 3ML account, but it had already become an integral component of my own business. I appreciate that you’ll be giving customers like me a place in your replacement (“expensive, distributed and bullet proof”) product. My investment in lifetime accounts here continue to be among the best decisions I have made.
— Eric Wagoner 110 days ago #@Bijan: that's correct. We’re going to significantly beef up the service.
— David Young 110 days ago #I have a 3GB non-expiring lifetime account on Strongspace. What will happen to it?
— Neville 110 days ago #@Neville: you will be moved to Strongspace’s successor.
— David Young 110 days ago #I started with just a Strongspace account, and I came into the Connector/Webhost/Strongspace combo when the Strongspace service was repriced and upgraded. Will I be left hanging in this transistion?
— James Lindeman 110 days ago #As a user who uses SS as a mirror of local files (with ssh+rsync), I consider the loss of SS bad news. Reliability is good, to be sure, but I don’t need Yet Another Redundant Mirror to be bulletproof. That’s why I use redundant strategies in the first place. Will I be able to access the SS functionality in Connector using the ssh+rsync mechanism? If so, no problem here.
— Ryan 110 days ago #@James: no. You will be upgraded to the enhanced service. BTW: once we complete open-sourcing Strongspace, you’d be free to run the service yourself anywhere you like. For a single customer, a Joyent Accelerator might make a superb home.
— David Young 110 days ago #It would be appreciated if you would keep your customers actively informed via the existing RSS status feeds. Several days between updates during a 10-day SS outage is really not acceptable.
That kind of silence would get any fired from our day jobs.
— Mike Linnane 110 days ago #Hi David,
I feel for you guys. I know this has been a rough time for the team. Thanks for being so open.
It’s not totally clear, at least to me, what will happen to those of us who have lifetime plans, as part of Mixed Grill, upgrades, etc. For example, from the SS dashboard:
“ Your Plan
You’re currently on the Premier Prepaid Plan, which means we won’t be charging you anything, ever!”
Will the ‘Lifers’ be paying some premium annual rates to keep what we now have?
— Michael Larocque 110 days ago #@Michael: no. You won’t pay more. You’ll get the new service, when it is available.
— David Young 110 days ago #Thanks for being so open and I wish you the best of luck, but that was just too scary for me: I’m out of here…
— Pete 110 days ago #@David it’s great to hear that SS and BD is being open sourced. Jason mentioned it the other day, and it’s definitely a great way for people to see that Joyent are serious about opening up their technology and allowing customers to build on it. Bravo.
— Jacques Marneweck 110 days ago #Thanks for the update.
I have a question regarding strongspace in Connector, and rolling BingoDisk into Connector:
1. Will the Connector strongspace interface be improved. The StrongSpace web interface is far superior to the Connector one for file management. I also find Strongspace far and away faster than Connector. Connector is always sluggish for me.
2. Does the comment regarding BingoDisk imply that all those on Connector will in effect get a BingoDisk, similar to the built in Strongspace?
How does the build in Strongspace and BingoDisk (in the future) interact with the Accelerator diskspace? Is it extra space on top of your accelerator or does it eat into the accelerator?
Thanks again for the explanation.
— Ryan 110 days ago #I think it’s important to point out that the answer to the headline question, “Was there a Backup?” isn’t yes and no. It’s no. A backup is a copy of the data living on a different physical drive. RAIDs and tough filesystems are not backups and every system admin in the world knows that.
I feel like this entire post and all the status updates are really misleading and dishonest. Explain to me how the corruption got into the backups when there simply weren’t any?
To claim that backing up the data would be too expensive is silly, that’s something for you to work out. If the product wasn’t financially viable, then you should have shut it down with notice instead of hoping nothing really bad happened. Or better yet, just not build it. Who builds a cloud data storage product without planning to back it up?
And for the argument that you can’t backup 18TB of data, well that’s just not true. I do it and so do lots of other companies. A redundant Thumber, some fibre and few scripts and your set to go. 30 hours to move 3TB of data screems out ethernet to me.
I think you guys all got really lucky this week, whether you think so or not. This could have easily been a disaster with lost data.
And what’s your solution going forward? Canceling and not supporting the product. Nice to know that Joyent will cancel a product rather than fix it to live up to people expectations.
I think Joyent needs to put out a document that describes in detail all the data protection and recovery policies for every product. How can we trust you that the Accelerators are backed up if you’ve been saying that Strongspace has been?
The best part? I got my bill for strongspace this morning.
— tom 110 days ago #All sounds good to me. Thanks, guys. One for the suggestion queue: add encryption to SuperSpace (ok, I just made that up) and have something really unique.
— Geoff Cheshire 110 days ago #Just want to agree: In terms of functionality, ease-of-use, beginner users “just getting it,” and so on, Strongspace has a great, great interface, one that the Connector could learn and benefit from.
Don’t get me wrong…I think Connector’s interface is unique and interesting…it just seems odd and ‘off’ when you get to the ‘files’ portion…even down to the level of how folders/directories and files are reported at the top of the right pane. In short, it lists directories as files…and then doesn’t show them underneath, so you get ’6 files’ listed, you look down and none of those ‘files’ appear. Strongspace has none of these problems. And the ability to have a guest user ‘jailed’ to a certain directory is just wonderful.
Wonderful of course that you’re open-sourcing stuff. Just want to make sure you appreciate what you have when you start assembling (yet another?) mystery super future product.
— jcburns 110 days ago #That is great, just one little detail bugs me.
You said i’ll get 2 months free bingodisk, the you go and say EVERYONE signing up will get these 2 months… so in the end, me, the already existant user who suffered with 10 days of offline time, will in fact get nothing back for the downtime.
I’m glad to see the eagerness to reply to us, but it seems i’m not getting what i deserve…
— Rafael 110 days ago #@Ryan: we will continue to improve Connector, including its performance. We will be adding WebDav/HTTP support to Connector files, as well. There is no storage relationship between Connector and Accelerator, currently. But we have plans to change that.
— David Young 110 days ago #@tom: we’ll have to agree to disagree. My fundamental point is a 15TB service such as Strongspace can’t be backed up to another device. It would take 150 hours just to copy the data. So, we designed a service that combined highly redundant hardware with a very fault-tolerant file system. For a very competitive price. We’ve decided we don’t want to do the competitive price. Instead we’re going to develop something that is redundant, fault-tolerant, always up, expensive. While we suffered downtime, we didn’t lose data. So the design worked.
The billing system for Strongspace is tightly coupled with the service itself. When the service can back on-line, people got billed. We’re proactively refunding those people, if they we billed from the Strongspace service (i.e. Strongspace customers before Strongspace was bundled with Connector).
— David Young 110 days ago #@Rafael: you’re right. We’ve up’d the free months for existing Bingodisk customers to 4 months.
— David Young 110 days ago #What continues to upset me about this whole mess, which still hasn’t been addressed by this post, is how irresponsible Joyent was in the first place for allowing these services to be brought down by a known filesystem bug that’s been fixed since, what, March? 10 months ago?
It’s one thing to base two “prosumer” level products on a newfangled unproven filesystem like ZFS, but it’s arrogant to the extreme to then refuse to ever perform necessary upgrades on said newfangled technology. With no real backups.
From out here, it looks like Joyent got caught up once again in the latest shiny new toys (Facebook, Accelerators), and once again left the customers of their boring old toys in the lurch. If I was a customer on one of the hot new Joyent products, I’d take a look at what’s happened here and wonder if massive preventable outages and canceled services are in store for those, too.
— Aja 110 days ago #@David: will it be possible to manage Connector files with rsync+ssh?
— Ryan 110 days ago #@David: Yes backing up 15TB of data at once will take a long time, but strongspace doesn’t create 15TB of data a day, that’s what incremental updates are for.
If you used 4Gb Fibre, trunked, you can cut that time down fast.
And if you think having 10 days of downtime “worked” than you and Joyent have no clue what it means to run a web service or a data business.
Anyway, if Joyent had just been honest, admit that you guys made mistakes, that the system was built out in a way that made it hard to recover, you would have possible saved this customer. But the arrogance and refusal to admit that Joyent messed up and just canceling the product while touting your success is unacceptable.
Seriously, who starts off a mea culpa with bragging?
Nowhere in that post does the word Sorry appear, btw.
You’ve lost 3 strongspace accounts, a few accelerators and a customer of over 2 years today.
— Tom 110 days ago #Like many others around here, I have been lurking for the last week waiting for both services to come back up and for events to unfold.
While I have encountered several glitches with Bingo and StrongSpace, I had hopes that one of them might fulfill my needs at some stage. I did not place all my eggs on one basket, though, mostly due to these glitches and now I am glad I made this decission. Still, in the middle of this turmoil, I renewed my subscription for Bingo on the 19th of this month.
David, your explanation is, in my opinion, lacking in several aspects if not outright disappointing. You make a clear line between your “commercial grade Accelerator products” and your other services for “prosumers” and cheapskates alike.
It now appears that the services we were placing our trust on were not “commercial grade”. A few months ago these were sold as a secure, reliable service and now they appear to be “inexpensive”, cost “competitive” solutions for “prosumers” who were apparently served a single point of failure to make ends meet. It kind of sounds as if we “prosumers” were given what we deserved. Not good.
I will not comment upon this last matter, the fact that no backup plan was apparently in place. Maybe you had good technical reasons to back this choice but I believe that Tom above managed to nail it better than I would ever be able to.
Frankly, I was about ready to keep using Bingo for a while together with another “commercial grade” service from a third party but I will have to reconsider this solution, specially in light of the fact that we cannot be sure that Bingo will be around next year or that this Thumper thinguie will collapse again in the rack under its own weight.
I sincerely wish you manage to work around this mess, but quite frankly, I feel extremely disappointed by this explanation.
— mamorim 110 days ago #@Aja: thanks for your input. Operating system upgrades are not trivial. The version of ZFS that we were running was more mature than that in Solaris 10. We were waiting for additional issues to be fixed in ZFS before upgrading the Thumper. It’s just not prudent to always be upgrading production systems.
— David Young 110 days ago #@Ryan: you can rsync and SFTP to the Strongspace folder in Connector today.
— David Young 110 days ago #@Tom: I’m sorry we’re losing your business.
I did say the words “I apologize” twice in the post. I wasn’t meaning to be arrogant at the beginning of the post. Only wanted to point out that Joyent wasn’t offline. In fact, we do know how to scale web applications.
Strongspace and Bingodisk were down to the reasons I’ve stated. We did not lose data. That is a big success. Fibre would have changed nothing. We chose to do the restores at a file level. The most conservative route. And the slowest due to the requirement to build all the file metadata. Incremental backups were/are not an option for such a large data set. There is an enormous amount of data turnover on Strongspace. And the file metadata is also prohibitive.
— David Young 110 days ago #@mamorim: from the home page of Bingodisk:
From the FAQs for Strongspace:
I don’t think anything was hidden from our customers.
I used the word “prosumer” to imply a product that was good enough for professionals, with easy to access methods (HTTP, SFTP) suitable for consumers.
— David Young 110 days ago #David, the team and all lifers,
Thank you for the update. It’s been a real hardship for all of us, including you guys, I know. As a Joyent lifer, I would encourage all of us lifers to come along side the Joyent team during these difficulties. Thanks to everyone who’s kept this positive, even the ones who’ve shared frustration but remained helpful.
I’m hopeful that somewhere down the road I’ll have a rockstar hosting/web service that will never retire or suddenly expire. In the meantime, open-sourcing and open communication brings a bit of comfort to a hellish 2 weeks. Read the post again.. I think some of you are skipping the apologies and explanations out of fatigue.
Thanks for enduring the labor… EVERYONE.
— Paul Ingram 110 days ago #Hi David and team!
Thanks for the thorough explanation. I feel with you guys when running into problems to explain the workings of a ZFS based RAID system to laypeople.
I have some really important stuff stored on my Strongspace account and I never feared for my data. The downtime was the only thing annoying to me.
I hope you won’t lose too many customers and people will take a moment trying to understand how there data really was backed up all the time. But maybe some other companies have magical ways to keep data even more secure.
- Sebastian
— Sebastian N. 110 days ago #Merlin Mann on Twitter
— Geoff Cheshire 110 days ago #I appreciate the detailed explanation and frequent updates.
I’m a VC3 customer (from Sep 2005) but I pay for 4 GB of Strongspace every month. So to clarify, I’ll be getting the next nine months free, after which time I can pay more to upgrade to the new service, or I have to find an alternative. Is this correct?
— John Topley 110 days ago #Just so there’s no confusion, there are two Ryan’s posting. I’m the Connector/Strongspace/BingoDisk interface and diskspace one.
@David.
Thanks for the reply. However, the Connector interface for file management is just not suitable. It was ok as a quick file dump from emails, but not as a real file system browser. It just isn’t. It can’t be tweaked to be good. It’s just not suitable for file system management. Strongspace’s web browser is. If you change Connector to contain a real file system browser like Strongspace then that would be fine. But as it stands it’s not a suitable substitute. It just feels unintuitive and clumsy as a file browser.
Having said that, merging the services into Connector is painful anyway. These things are best used as separate services, not as afterthoughts in a webmail interface.
re: diskspace, are you planing to merge our Connectors and Accelerators, effectively making the Accelerators bigger?
That leads to the question: What exactly is “Strongspace” inside the connector? If we end up with just a large pool of diskspace in an Accelerator doesn’t “Strongspace” and “BingoDisk” lose all meaning? Do they just become the labels of open source web interfaces? What would installing Strongspace or bingoDisk actually mean? Just an interface to a chunk of space on your accelerator? Why would you do that, if it’s all just space on the same Accelerator account? I haven’t used BingoDisk so I don’t know what that’s like, but if you were to install Strongspace on an accelerator, what would that mean? Just that you’ve added a neat web file browser?
Lots of questions. I’m just blurting.
Cheers,
— Ryan 110 days ago #Ryan
Also, I appreciate that the stand alone “Strongerspace” will replace the old “Strongspace”. Just after clarification on what exactly Strongspace/BingoDisk inside Connector actually is and means, and if they’re then merged further into Accelerators, what that actually means if it’s al just space on the same Accelerator.
Cheers
— Ryan 110 days ago #@David: I get that you guys choose not to do incremental backups, but you need to stop saying that it can’t be done. Lots and lots of people are doing incremental backups of data sets much larger than yours.
Plus you’re admitting it’s possible by saying you’re in fact building such a system, just for more money, in October.
There are other options as well, you could have one server accessing two storage pools in two different enclosures and mirror the data to both on write.
It’s been done.
My comment about Fibre was simply meant to show that you can move a lot of data very fast for backups. If your data set is spread across 48 spindles, your read times are blazing fast. The transmission interface becomes the bottleneck, not the size of your data set or the metadata.
A backup would have changed everything.
I get that you guys made trade offs with regards to data security, time and costs. All system admins do the same thing every day. But refusing to admit it was a trade off and that you had other options is the issue at hand.
I don’t mind problems, I mind fake transparency, which is why I haven’t said anything during the entire downtime. You still haven’t addressed your remarks regarding the corruption getting into your backups, when in fact you had parity information, not a backup. And you know, and everyone who works at Joyent knows, is not the same thing.
You also really shouldn’t argue that using ZFS protects under all scenarios, and you can’t loose data. It’s just wrong. It’s safer than most filesystems, but things can happen, you CAN loose data. There’s no such thing as the perfect RAID, the perfect FS or the perfect hardware. Multiple things can break at the same time, it happens. Some of us have been unlucky enough to see it happen.
— Tom 110 days ago #@John Topley: if you bought Strongspace before we bundled if with Connector, you’ll get the next 9 months free and a couple for 2 free months on the new service.
— David Young 110 days ago #@Ryan Interface: the goal is to provide Connector with SFTP and HTTP interfaces to its file store. Re: your issues with Connector’s UI for file browsing. I hear you. The Connector UI is not a webmail interface, it was built to be a generalized data item browser and manipulator interface. The UI can handle email, calendars, files, lists, bookmarks, contacts all in a consistent UI.
— David Young 110 days ago #Is the plan to e-mail the affected customers regarding the retirement of strongspace?
— Tyler Ritchie 110 days ago #@Tom: you’re right, we can do it a product the does mirroring, etc.; but it would be 7-10X the current cost per GB. I’ve post in an earlier comment text from both the Bingodisk and Strongspace information pages disclosing that these services were running on a single Thumper. That’s what got us the price point we were trying to achieve.
I stand by my assertion that a device to device backup would not have had much impact in this case. In order to be very conservative, which we were, and I mention this in the post, we did a file by file restore rather than a block restore. A file by file restore requires that file metadata be created. For the ~15TB (Strongspace) we had to create, that came down to copy times of roughly 1TB every 10 hours, or 150 hours to copy the data. This time was punctuated by some trial and error to assess what the problem actually was.
I think I was clear in the post that in this particular instance backups=ZFS+Thumper.
Finally, I never made the claim that ZFS protects under all scenarios. I did claim the we felt ZFS+Thumper provided us enough protection. That proved to be true. I also said that given the inherent “bet”, we’re getting out of a storage based solely on ZFS+Thumper and will be offering one that has much more redundancy (for uptime), but which will be, by nature, much more expensive.
— David Young 110 days ago #@Tom: you’re right, we can do it a product that does mirroring, etc.; but it would be 7-10X the current cost per GB. I’ve post in an earlier comment text from both the Bingodisk and Strongspace information pages disclosing that these services were running on a single Thumper. That’s what got us the price point we were trying to achieve.
I stand by my assertion that a device to device backup would not have had much impact in this case. In order to be very conservative, which we were, and I mention this in the post, we did a file by file restore rather than a block restore. A file by file restore requires that file metadata be created. For the ~15TB (Strongspace) we had to create, that came down to copy times of roughly 1TB every 10 hours, or 150 hours to copy the data. This time was punctuated by some trial and error to assess what the problem actually was.
I think I was clear in the post that in this particular instance backups=ZFS+Thumper.
Finally, I never made the claim that ZFS protects under all scenarios. I did claim the we felt ZFS+Thumper provided us enough protection. That proved to be true. I also said that given the inherent “bet”, we’re getting out of a storage based solely on ZFS+Thumper and will be offering one that has much more redundancy (for uptime), but which will be, by nature, much more expensive.
— David Young 110 days ago #David,
I appreciate the problem and the explanations, and frankly, wasn’t too directly affected by this outage. But I have to admit, the promise of something in Oct 2008 to lead to a reliable StrongSpace replacement — not convincing. Joyent as a company has an abysmal record meeting the rosy deadlines that are frequently thrown about
Without the need to rehash all of the predictions of delivery date, I’m still waiting on:
* TextPanel (yeah, I know, never coming — but I’ve been here long enough to count that as a promise)
* Telephone support for customers without L+ accelerators
* Accelerator tickets (I got mine, but still backed up, no?)
* IronPorts (or any spam solution) on Accelerator
* aliases on connector
and on and on
So with all due respect to the Joyent team, it’s harder and harder to take any promise of a future service seriously. Is there anything that’s changing that should give me hope this time?
Frankly, I’d settle for executing on existing promises before the new ones.
— John Paul Ashenfelter 110 days ago #Thanks David, I think that’s more than fair compensation, actually. I appreciate it.
— John Topley 110 days ago #@David: I never said that Joyent doesn’t know how to scale web services. What I did say:
And if you think having 10 days of downtime “worked” than you and Joyent have no clue what it means to run a web service or a data business.
I stand by that statement. The thing is, I don’t think you do believe that it “worked” / that it was a “success”. How could you? It’s why this entire posts makes me so uncomfortable. It’s PR / Spin / saving face. It can’t be the truth.
You ended up losing a ton of revenue, you’re not charging for strongspace for the next 9 months and you’re giving 4 months of free Bingo service. Plus all the man power that went into fixing this, there’s now way you think this was a success. I can’t imagine that at Joyent everyone’s running around talking about what a great success this is. I just don’t buy it.
— Tom 110 days ago #I consider it a “success” that I didn’t lose my data. Downtime sucks alright, but I’m not seeing the “spin” here.
— Brian 110 days ago #Strongspace was a success, until we hit 10 days worth of downtime.
It may be non-trivial to upgrade production systems, that’s why there needs to be a level of redundancy (backups, mirrors) at the server level as well as the file system.
Write to both during normal operation, write to one while the other is down during an upgrade and subsequent testing, then re-sync the data. That would have to be more trivial than what we’ve just been through.
As it has been pointed out it is possible, and I find the 7-10 times cost a little difficult to swallow. Double the hardware, add a faster pipe, and save the cost incurred over the last two weeks as a result of the downtime – lost customers / reputation, and mad-rush efforts to recover data.
To leave a stand-alone system running old code with major known / resolved bugs (the impact of which has been described as “when”, not “if”) because it has to go down for extended periods during an upgrade ,without any testing how the updates will play out with production data seems questionable at best.
— Tai Lee 110 days ago #I have a feeling that in the end, I will be priced out of any Joyent offerings. I will enjoy the free service for the next 9 months though. Ultimately, I hope it makes financial sense to stay.
— James Lindeman 110 days ago #You know you’re in trouble when mrmachine is the voice of reason compared to you.
Joyent’s real problem is that they were running the services on a shitty freebie thumper from sun. Buying a real one would’ve cost a fortune, much more than the 10x they’re claiming.
But hey, you get what you pay for, rsync.net is available right now.
— Wooznee Zoobarak 110 days ago #I’m sure you never saw this “Black Swan” coming: “Black Swan” an unplanned event that destroys all plans.
The ZFS bug you encountered made all the redundancy and process plans for your storage irrelevant. I’ve been seeing many companies move to a disk-based backup recovery strategy and many will experience similar service outtages because the volumes of data under disk-based management have growth to many Terabytes and the software being used (Virtual Tape Libraries) are assumed to be capable of not loosing data.
In your case, the software managing the data as effectively a system Operating System that needed to be re-installed. So, you ended up moving 15+ TB of data between two storage systems over ethernet to insure the files were protected after re-installing a fresh copy of the patched OS.
I feel the pain of the entire Joyent organization. You attempted to give the best answers available without overstating while trying to engineer past the obstacles and re-host the customers files.
I congratulate Ben and the admin team for what must have been days of support nightmare.
Systems-as-a-service is an immature IT methodology and there will always be surprizes for new innovate appraoches and occassional serious outtages. It comes with the risk of bringing new innovations to market.
Joyent is a risk taking company and the effort to save money, deliver a high quality service with the most innovative engineering practices is appreciated. The transparency with which you conduct your business is admirable and can serve other less innovative IT shops as either a proof-of-concept for new approaches or a case study for which technologies are still not ready to base your career on.
They call it the bleeding edge because it sometimes hits a major artery: you didn’t loose any data. You just halted the delivery of the service. The blood loss wasnt fatal to Joyent and hopefully not to any clients.
— McD 110 days ago #I think the point though re: Black Swan is that the black swan is unexpected, but with the StrongSpace setup failure was expected due to the architecture.
In addition, failure was expected for 10 months because of the known bug in ZFS code.
Recovery from any failure was expected to be difficult, due to placing 15TB of data on a single Thumper with no other backup.
No data was lost in this case due to luck, not design forethought. It just as easily could have been.
Joyent’s attitude is all wrong. This was not a success. And it’s sad, because I think a lot of people are rooting for you guys.
Here’s an example of the right attitude towards failure:
http://www.joelonsoftware.com/items/2008/01/22.html — Bob 110 days ago #Joyent team: thanks for bringing our data back from the brink. much appreciated. it was rough patch that you all went through. you survived and will take big lessons away from your ordeal. May the Data Archiving Gods be with you and also with our data. cheers…
— giovanni 110 days ago #I think joyent needs to spend a little more time actually building redundant systems and a little less time telling us how great and reliable your products.
3 Tb of data is trivial to backup and move when you have the right hardware. No really, it is. This isn’t 1984.
Fibre, fast drives and smart data backup regimes mean you can snap the entire data set once, then increment data changes endlessly, stitching a ‘full’ set at any time to recover.
It’s a shame as lessons don’t seem to be being learned. Redundancy and backups aren’t a luxury in a production environment, they should be a core component.
You dodged a bullet. It was pure, blind luck. Learn from it.
— Brendan Borlase 110 days ago #I think there is a lesson to be learned from your unfortunate experiences over the past couple weeks. While you’ve certainly been schooled the hard way, hopefully other companies will see this as an opportunity to assess their own backup policies related to mission critical systems and take corrective action where necessary.
I wish you guys the best of luck with your successive solutions and hope that you can recover from this unfortunate incident.
Duncan.
— Duncan McAlynn 109 days ago #I think the main problem here is that the rationale behind the SS architecture with its pros and cons were not well advertised (one had to deduce them from the FAQ mentioned by Dave – almost impossible for non-technical customers) so people started relying on this service more than they should.
Personally I’ve always considered SS as a “secure online backup” and as such I’ve been perfectly happy with it and still am I.
All this unfortunate event didn’t change at all my overall confidence in Joyent’s expertise and commitment to provide great services.
This post and its comments are an exceptional example of openness too, for which I thank Dave and his team despite some mistakes they might have committed (e.g. the “corruption went into the backups” message).
Personally I was a bit puzzled by the proposal to provide the new “stronger and much more expensive space in 8 months” which sounds like a rush job: crisis time can be a dangerous moment to take long term decisions… but that’s up to you!
— dolom 109 days ago #@David: That’s wonderful! Congratulations to bingodisk for this atitude.
— Rafael 109 days ago #I’ll keep my business here. Thanks
David, thanks for the update.
I think others have pointed this out before as well, but we really need a single location to determine exactly what plan we are on. Transitioning through the various sphagetti offerings (from VC2 to Mixed Grill to something) I have completely lost track of exactly what is it that I am entitled for. The customer.joyent.com web site tells me my Connector plan. The StrongSpace dashboard tell me my SS plan. What about my SharedAccelerator plans?
Such a web site would also be extremely useful when such an announcement comes out. I simply log in and it tells me how my plans have been upgraded/changed.
Thanks!
— Diwaker Gupta 109 days ago #@Diwaker: we’re working on it. Thanks for the feedback.
— David Young 109 days ago #@dolom: thanks for the feedback. The new service isn’t a rush, the announcement of it, yes. We’ve been planning a better storage service for some time, but the implementation details are a bit complicated. I think you’re going to like it.
— David Young 109 days ago #@Brendan: it took 10 hours to move 1TB. That’s the fact. This was over a 10GB network. The issue isn’t the equipment, the issue is we were being conservative to ensure we didn’t lose data. So we did a file restore, not a block restore. As I said in the post, file restores, especially for large data sets, require lots of time because the file metadata needs to be recreated.
— David Young 109 days ago #@dave: I’m sure I will and am already looking forward to beta-testing it!
BTW: I second the request for a centealized customer page detailing the plans we are on…
— dolom 109 days ago #thanks for the detailed follow-up, the compensatory offerings, and the future path that prevents this type of problem from occurring again.
much appreciated.
P.S.
— David M. Besonen 109 days ago #i also very much appreciate Joyent’s willingness to significantly revamp a offering when it’s becomes obvious that the offering in question has become an albatross.
So you respond to 10 days unscheduled downtime with essentially a publicity exercise about all the contributions you’ve made to open source? That’s great news for your customers.
With that and the “red headed stepchild” attitude towards old TxD customers I think I’m done here. TextDrive/Joyent’s attitude towards deadlines is at a level unthinkable for ANY company, let alone hosting companies.
— Brad Wright 108 days ago #You know? I’m not a techie and all this is missing me by a mile. I’m on strongspace for only 2 reasons:
1. I can upload large files
2. I can set up accounts for selected people I want to share those files with
Please just let me know if I can continue to do this RELIABLY. I’m not too worried about price – I’d paying only a few bucks a month, but frankly, I’d pay $50 a month readily for what I need if it was just fast and reliable in doing those things that I need.
Your comments on that please David? And if Strongspace is coming to an end, what’s in store for customers like us?
Thanks!
Chris
— Chris Tan 108 days ago #As a VC “lifer” I knew I was taking a bit of a risk by signing up for “lifetime” services, but that’s what VCs do. Although the service has had its growing pains, knowing that it is striving to improve (and take my account with it to features I never anticipated) while still honoring my “lifetime payment” is good enough for me.
When I weigh a couple of weeks of downtime, less than stellar uptime in general and some performance issues against the option of the company folding (a very real concern when you pay for lifetime anything) or the option of facing years of continued payments, I’m happy with my choice.
Glad things are back on track, and thanks for all your efforts.
— Sarah 108 days ago #I think the plan was to have the files stay on disk… many disks with multiple copies. No tapes we’re designed into the scheme and no off-site or remote location strategies.
Many companies are following these trends.
The bug with ZFS didn’t impact the environment until a “zpool import” failed and the failure pointed to a known failure. The fix being: update your OS instance… there was no simple OS patch to fix the environment.
Upgriding the OS on a Thumper meant the ZFS file needed to be offloaded to another thumper and the system re-built from the OS and re-loaded to be put back into production.
Backup, recovery and disaster recovery issues have all been throughly reviewed publicly.
I’m rooting for Joyent to regain it’s edge and confidence. The well capitalized players can throw hardware and resources at the problem but Joyent is actively seeking new solutions that low costs and deliver value.
No data was lost and the new service will benefit from the experience of this event.
I still see it as a Black Swan event. Ben and his team didn’t ignore facts related to ZFS they just discovered the exposure through working issues as they surfaced.
Joyent is supporting “OpenSolaris” themselves and they can’t leverage support mechanisms that commericial software provides… patches, deep engineering effort.
Like as always is made of compromises… re-designing the service is the right move.
— McD 108 days ago #@ChrisTan: yes, you can use and rely on Strongspace. As David mentioned in the post above, we will be phasing the product out this year and are working to replace it. As a Connector customer, you will get a coupon for 4 months of free service when it launches.
@Sarah: thank you very much for your feedback. One thing I would like to note since you mentioned ‘uptime’ is you would benefit greatly by migrating over to the new Shared Accelerators on Solaris. It is a free offering to all Joyent/TextDrive customers currently on the BSD servers.
More information can be found at http://discuss.joyent.com/viewtopic.php?pid=159581#p159581
— Kristie Wells 108 days ago #Don’t want to be pedantic but when you say “The replacement service will likely be introduced before October, 2008. We will retire the current Strongspace on 1 October 2008.”... does it mean you will retire SS on Oct 1st no matter what’s the status of the new service?
— dolom 107 days ago #I’m certainly glad to have my 3Gb lifetime space back – interestingly, downloading all my files somewhere else went much, much more quickly than my colleague’s experience of Strongspace transfer speeds in the past. (He’d noted a pretty solid cap at around 128 Kbyte/sec, consistently, whatever connection he used; I was getting several Mbyte/sec!)
Until David said it was aimed at people hosting static web content, I assumed Bingodisk was aimed at precisely the use I would have for it, except with the wrong protocol (webdav rather than scp/rsync) and at a price I liked. I have no use at all for extra user accounts (indeed, they don’t work with the protocol I do use Strongspace for anyway!) or for encrypted storage (I might encrypt the content I upload, but of course that’s entirely unrelated) – I just want X Gb of rsync/ssh storage. Unfortunately, that’s somewhere in the gap between Bingodisk and Strongspace – with “Strongerspace” looking like it only widens the gap further. From the Strongspace users I know, I think I’m far from alone in this!
(I’m a little puzzled why it had to be one enormous filesystem, rather than simply spreading users across multiple smaller systems as presumably the Accelerator and Connector hosting is arranged? Of course, having all the storage accessed through iSCSI would have avoided almost all this downtime: upgrade a different machine, point it at the iSCSI storage, zpool import, the end. Is the iSCSI storage that much more expensive than local disks in a Thumper??)
— James 107 days ago #@Kristie,
I’m responding to your post from the other thread since you wanted to continue the discussion here.
I posted that comment shortly before David’s post. Both the StrongSpace and BingoDisk pages had and continue to have a somewhat ominous sounding “we are no longer offering the BingoDisk (or StrongSpace) product line.” Since that was posted before David’s official response, it caused some amount of consternation. Additionally, the BingoDisk page here (http://www.joyent.com/connector/bingodisk/) does not allow signups, while this page (http://www.bingodisk.com/) does. I still stand behind the content of my previous post (and these two pages support it to some degree). Thanks for the response.
— Mark 107 days ago #