Joyent

Joyent Weblog

Strongspace and Bingodisk: Update

High-level Explanation

Joyent is working to bring our Strongspace and BingoDisk products back on-line after they were taken offline this past Saturday (January 12) due to instabilities (e.g. read errors, checksum failures) Joyent has experienced with Sunfire X4500 hardware (aka “Thumper”) only exposed after a series of upgrades to the operating system itself. A ZFS bug prevented a speedy recovery. No other Joyent products have been affected. Strongspace and BingoDisk are the only services running on X4500s. The rest of our X4500 inventory is used for cold backups of Joyent server nodes.

ZFS continues to be a competitive advantage for Joyent and our customers. Typical nodes at Joyent running ZFS can recover from crashes in a matter of seconds. As I said, the interruption of Strongspace and BingoDisk has not affected other Joyent services.

Technical Explanation

The nature of the Strongspace and BingoDisk products has exacerbated the recovery of these services. Each service runs together on a single Sunfire X4500. The X4500 is a dual-socket, 48 by 500GiB drive server/storage device. It has been pointed out elsewhere that we were running an older version of the OpenSolaris operating system on this X4500. That is true. However, since this particular X4500 also housed two services (rather than just backups), we had been waiting to upgrade the X4500 in anticipation of some software updates that were/are in the pipeline for OpenSolaris itself. The improvements include an improved “scrub” (which can stall or hang, currently), faster ZFS lists (sometimes take more than an hour to list datasets on machines such as an X4500 with sizeable data), the ability to recursively replicate datasets, and graceful recovery when a device (i.e. drive) fails. Unfortunately, OpenSolaris does not currently provide a straightforward upgrade process from build-to-build. If all the stars aligned, an upgrade takes about six hours. Realistically, we estimated we would have needed to schedule a multiday downtime given the historical uncertainties around importing zpools from older version of ZFS into newer versions of ZFS. Further, the sheer amount of data managed by these services means moving data around for recovery purposes takes lots of time. We’re up against the laws of physics. As predicted, the upgrade of the operating system, in response to the interruption, went relatively smoothly. We have been working for 3 days to get the data usable. We have swapped X4500 chassis completely. Work continues.

Conclusion

We continue to work to restore Strongspace and BingoDisk. We are still in the midst of the data recovery process. I will continue to update this post as I learn more. If you have questions, please post a comment and we’ll try to answer.

Update (Thursday, 17 January, 6:30am Pacific)

In layman’s terms, we have been struggling to get the data off affected Thumpers (X4500s) without the system locking up. However, there’s positive news to report this morning. From the trenches (from Ben Rockwood):

Around 6:30pm (Wednesday, 16 January, Pacific) Mark [Mayo] brought over his awesome PERL script which manage parallel rsync (which he implemented in San Deigo and has used with great success elsewhere). He kicked off that process but shortly there after data ceased moving and everything locked up. This was almost identical to what I experienced the night before (although I was using zfs send/recv in a loop rather than rsync). In both cases all IO ceased and everything blocked behind a single disk. The frustrating thing was that although both events were identical, they were blocked on different disks.

I recalled attacking various read error problems on Alpha in San Diego, similar to what we’re seeing on Thumper2 and re-invested myself in that reseach. As a result I’ve disabled NCQ on Thumper2. SO FAR the result is extremely positive! I imported the pool and have been using Mark’s uber-rsync to move data off and so far not a single bus reset or read error!! Fingers crossed!

I’m heading to bed now, but I’m really encouraged with the progress right now. Data is moving off cleanly without errors at a rate of 50MB/s!!! This is a significant improvement over the 25MB/s I was getting previously.

I will provide another update later this morning.

Many of you have asked about the compensation Joyent will be providing to those affected by this outage. It will be generous. I’d like to get service restored before I announce our compensation plan.

Update (11am, Thursday, 17 January)

A further update from Ben Rockwood:

We’ve been fighting and fighting with Thumper2. Previous attempts to copy data off went for a while and then wedged. However, this morning, I’m excited to say the transfers started last night at 3AM are still running!

And! No read errors like we were gettings piles of earlier!!! With over 6 hours of transfer not a single error!

When the data has been recovered, we can restore service. I’ll continue to update.

Update (7:45am, Friday, 18 January)

We’re expecting to be able to come back on-line by mid-afternoon today (Pacific time).

Update (9:00am, Sunday, 20 January)

As noted yesterday, Bingodisk is up. This from Ben, regarding Strongspace:

The final resilver on Thumper2 is approx 8 hours from completion. We’ll bring the services back online early afternoon. No further complications are expected from this point on.

Strongspace will be back on-line today.


  1. I have two questions that I was not able to answer after reading the above.

    1. Why have messages not been posted to the Strongspace and BingoDisk sites or sent out via email? I’m guessing that not all users of those services watch this blog.

    2. Are users of these services at risk of loosing data?

    Dustin    535 days ago    #
  2. I’m with you guys… I know you’re going to fix this: forza Joyent!

    — dolom    535 days ago    #
  3. It would be nice to post an informational message or link that people will see when they go to an affected Strongspace page. This is better than putting the burden of finding out what is wrong on the customer. Like Dustin said, email notifications would be nice too. Trouble with systems is understandable but ignoring notifying those affected is not. Thanks for working on it diligently and providing these informative posts though.

    — Arthur    535 days ago    #
  4. The Joyent Server issues page in the support section of Joyent.com has up-to-date info on these issues.

    fitzage    535 days ago    #
  5. What I don’t understand is how this has affected your backups. If you have backups on your other thumpers, why haven’t you restored those or something?

    I hope the backups weren’t on the same filesystem / zpool!

    — Steve    535 days ago    #
  6. @Dustin, @Arthur: we’re addressing the issue of telling people now.

    David Young    535 days ago    #
  7. Sadly, I predicted this 13 months ago and commented about it right here on this blog … when I wrote:

    >>“how does an upgrade to the Thumper OS occur without taking everyone offline?”

    http://www.joyeur.com/2006/12/14/utility-or-generator-in-the-back-yard#c001209

    Sorry to hear about the downtime guys :(

    — Jonathan K.    535 days ago    #
  8. @Steve: same version of OpenSolaris. Same issue. Serendipity.

    David Young    535 days ago    #
  9. @Jonathan K.: well, just to correct your comment, from way back when, Joyent has never had everything on a Thumper. Ever. Just Strongspace and Bingodisk. Connector and Accelerators are on different storage infrastructures.

    David Young    535 days ago    #
  10. I am not a techie. While I can appreciate the technical issues the grunts on the ground are facing, we are paying for a service and aren’t getting it. It is affecting our company’s operations big time and I find this completely unacceptable, technical explanations notwithstanding.

    I think it’s about time your company has a good, hard look at how things are being (or not being) done. If Joyen cannot meet the service standard that people are paying for, they will soon vote with their feet and go elsewhere. The quality of service we have been getting is just way below acceptable.

    As a paying client, I think we deserve an explanation (not a technical one) from the company, sent directly by email to us, and what measures Joyent is putting in place to ensure that these things don’t happen again. So far, we’ve been left in the dark, having to come poke into this website just to find out what’s happening. Some forthcoming proactivity will go a long way.

    The service quality so far is simply not acceptable.

    Chris Tan    535 days ago    #
  11. I completely agree with Chris Tan. Joyent has had more problems of this kind than any other company I’ve ever worked with.

    The MOST IMPORTANT THING is keeping customers in the loop pro-actively when things happen.

    — Richard Linkner    535 days ago    #
  12. @Chris @Richard: we’ve been updating the status of Strongspace and Bingodisk on the status page. We’ll try to be more proactive in future. Our services are utilities. We try to ensure they never go down. But the latest issues with Strongspace and Bingodisk are, literally, a perfect storm. That’s caused damage. Again, my apologies. We are working around the clock to restore service.

    David Young    535 days ago    #
  13. Your shared hosting has been laughable for over a year now, but Strongspace was always the thing I could rely on. I subscribe to your status RSS feed, but have I heard about this issue till now? No.

    Come on guys – how hard is it to copy and paste the above into an email and email out to all the affected customer accounts.

    Oh, wait, I get it – the customer accounts database was on the now-dead thumper, right? And you don’t have a backup? Nice.

    Tell people about what’s going on – tell people about what you’re doing. Tell people what might be lost, and tell us what will be put in place as reparation.

    I’d love to say I’m surprised, but you guys have seemed more concerned about good design work than paying customers since you had your makeover last year. I’ve got loads of SSH space dotted around the web, but I use StrongSpace because it just works… Guess I’m going to have to look into setting up my own system. At least that way, I’ll know when it breaks.

    -J

    — Hostile Monkey    535 days ago    #
  14. I can’t believe I haven’t gotten an email from you about this. The point of Strongspace is that it is meant to work quietly in the background, and unless I’m obsessively checking my rsync logs failures won’t be obvious. Reading Joyent’s blog is not on my daily (or monthly) list of things to do, either.

    Funnily enough, I found out about this failure (via reddit, ffs) in the middle of a discussion with a friend of mine about the irritiatingly sophomoric email DreamHost sent us both the other day regarding an overbilling screwup they just inflicted on most of their customers. He’s switching hosts from them as a result, but even though the DH email sounded like it was written by a 15 year old, it was sent out in a timely fashion, it explained what had happened and it told you the problem had been fixed. Meanwhile over here things have been silently out of whack since the 12th? Jeez.

    Kieran    535 days ago    #
  15. The bug you linked to above has been fixed for over a year, and the fix was last updated nearly a year ago. I would expect a service advertising safe online backup to stay on top of known bugs and update accordingly.

    — Mark    535 days ago    #
  16. This beats the shit out of TextDrive wrongly slaying many of their customers’ supposedly rogue Rails apps for three months (`samurai` vs. incorrect metrics) and then coming out later on the TextDrive user forums and quietly admitting it was their fault… and no, no financial compensation or apology when we wrote to ask about the situation..

    — Derek    535 days ago    #
  17. Mark: I don’t want to trivialize the severity of this issue, or try to place blame elsewhere, but it does bear repeating that the bug linked to above wasn’t the cause of the problem. It got in the way when attempting to recover the zpool, and created about 6 extra hours of work, yes. As David and Jonathon K stated above, doing upgrades on things like thumpers is non-trivial and high-risk, and so we watch the commit logs and bug reports and decide what constitutes a critical problem that demands an upgrade, and what doesn’t. The root cause of this problem isn’t yet fully understood, but I can tell you that we’ve hit several apparently “unique” and “impossible” conditions that definitely have not been reported in any existing bug or commit log.

    I really wish it were as simple as “stay on top of known bugs and update accordingly”. Really, I do.

    Mark Mayo    535 days ago    #
  18. Add me to the list of people who wondered why we only found this out when backups started failing, with no notification in the RSS summary or an email (my monthly bill acknowledgments seem to get to me on a timely basis, so the mailing infrastructure exists).

    I don’t want to pile on here, as I have done this kind of work in a high-volume/high-availability environment, but communication is really important, especially to non-technical people. No one likes surprises and non-technical people really hate surprises they don’t understand.

    When you do your post-mortem/after-action report, make this a bullet point.

    — paul Beard    535 days ago    #
  19. The Current Issues feed has had a post continuosly updated about this since the 12th. The first notes don’t tell much, but if you say you follow the feed and still didn’t know Strongspace was down, that is not Joyents fault.

    — Daniel    535 days ago    #
  20. David,

    Just before I went to bed last night I left a comment about what you guys are going to offer to the affected people/customers (besides your apology).

    For whatever reason, this comment was erased from the forum.

    Why?

    Bart

    — Bart    535 days ago    #
  21. How is this going to affect our fees? Are we going to get extra days? or money back, or anything like that?

    I’m realy bothered by this because i signed on to get some new features on my site up.. and just after i got them up, the service went down and my users are left with a blank.

    Rafael    534 days ago    #
  22. @daniel – the initial post appeared in the RSS feed, the subsequent updates did not (unlike other status updates). The feed does not seem to have been updating its “publication” date, so NetNewsWire has decided that it’s still an old feed item and has not been displaying it:

    Sat, 12 Jan 2008 20:04:06 GMT

    NNW doesn’t seem to be at fault here as this is expected behaviour.

    So, add “fix the RSS feed” to your list of things to do, guys.

    — Hostile Monkey    534 days ago    #
  23. I also don’t understand why an e- mail was not sent to users with a notification of the problem. I finally checked out this forum page and discovered why I haven’t been able to up-date my files. If you want to be taken seriously, you should behave professionally towards your customers. When one pays for a service, (no matter how little), one expects to be notified when there are problems, and be kept up to date. This is actually causing me problems and I am having to resort to a different service until you sort out this cock-up.

    — roger b    534 days ago    #
  24. Please remove the “register for account” pages while the fixes are being processed.

    Yesterday I registred, the payment went thrue, but application error on the registration part.

    I guess I should have read the “status page” first, or not…. ;(

    Regards
    // Jonas Montonen

    JoNtE    534 days ago    #
  25. Looking for some estimate of when I can expect service to be back-up…..In urgent need of access to some files that are located on the system.

    sta

    Sterling Ashby    534 days ago    #
  26. Do you have an an estimate as to when service should be restored? The next 48 hours?

    — William    534 days ago    #
  27. Daniel,

    Let’s leave the discussion on why you did not inform us in XYZ way for now.

    When will the service be up again? I need some important docs before the weekend.

    Bart

    — bart stevens    534 days ago    #
  28. The BingoDisk signup page has taken down, we are not selling services until the system is back up.

    @Jonas: it seems your order got through before the page was pulled which was my mistake. A credit has been issued to void out the charge.

    Strongspace is part of the Connector product which is still available. When you purchase an account, it notes the Strongspace piece will follow. Connector Collaboration Suite and Hosting have not been affected by this outage.

    The amount of data that was required to copy onto new Thumpers is simply incredible, and is taking longer than expected. We hope to have an ETA on when the systems will be back up shortly.

    Kristie Wells    534 days ago    #
  29. it’s a work of drama reading of your trials and tibulations over RSYNC uber cool PERL and other wonders of the magic that is ZFS
    Looking forward to the concluding chapter!

    Forza Joyent!

    — giovanni    534 days ago    #
  30. Actually I believe it’s time to sympathize with these guys who have been working night and day over four days in order to recover from the biggest mess of their life.

    It’s pretty clear they have some responsibilities for the mess and for the lack of communication that followed, and it seems to me they’re well aware of it, but we will talk about it when the mess is over and everyone will decide what to do.

    Let’s hope our data is back soon and, once again, Forza Joyent!

    PS: do you realize how long it takes to move 10 TB of data at 50 MB/s? It’s more than 55 hours aka 2.5 days just to move the data at full speed (which was reached few hours ago)...

    — dolom    534 days ago    #
  31. I do sympathize with the folks working on the problem and I hope that the comp. for down time is fair.

    That said, the performance of strong space is awful. I cannot even count the number of tickets I’ve posted about slow response, slow sftp, slow login, invalid subdomain logins, etc. etc…. It goes on and on.

    Every ticket gets answers 2 to 3 days later with a simple “problem is now fixed”. I guess it is hard to be patient and sympathetic when this current crisis is just a symptom of a larger technical and customer service problem.

    — carebear    534 days ago    #
  32. On comments vanishing, note that the button says Preview. The “preview” looks a lot like your comment has been posted, it’s not clear you have another step.

    I want a push notification when outages like this happen, and a push notification when the service returns. Reloading a web page day after day is stupid.

    — slumos    534 days ago    #
  33. This is best seen as an indication that not much has really changed.

    I’ve been a longtime customer, been through the Planet outages, the constant crashes at the next datacenter, the slow response times of services, the killing of user processes, etc.

    Growing pains. Bad users. Bad servers. Things happen.

    Things have gotten better, but it’s been painful. We thought we were out of the woods. Again. Wait until we move from the Planet. Wait until we’re on Solaris servers. Each iteration is supposed to solve the problems.

    But the real problem has never changed. Communication. Throughout every major outage, through every serious problem this company has had, it has done a piss poor job of keeping its customers informed about what’s going on. There are tools in place to inform us customers: mailing lists, RSS feeds, blog posts. Just like the days of old, it’s up to US to notice that something’s wrong and then do our damnedest to find out where it’s buried, then pester the admins with questions about what’s going on because none of them – NONE OF THEM – have ever done a decent job of doing so on their own.

    — toloden    534 days ago    #
  34. hey benr,

    the blow-by-blow updates are really appreciated.

    this very much meets my need for consideration.

    muchas gracias!

    — David M. Besonen    534 days ago    #
  35. @Bart: I want to assure you – no comments have been deleted from this blog post. There are two steps to publish, the first is ‘preview’, then ‘submit’. I checked the logs, there is no comment from you.

    Also, side note – the Daniel that responded was not the Daniel that works here. Just thought it important you know that.

    @Everyone: as for communication, yes, it could have better. Absolutely. It is something we are working to correct, and will put systems in place to ensure the immediate distribution of information so you are properly informed.

    Kristie Wells    534 days ago    #
  36. Yes, that’s right Kristie. I’m just a customer. Sorry everyone, both Joyent and customers, for any confusion I caused.

    @Hostile Monkey: I use Google Reader and somehow it has detected the changes and shown them to me along the way, even though the publication date was not updated. I stand corrected.

    — Daniel (a customer)    534 days ago    #
  37. I have to say I’m still not satisfied with the backups explanation, I don’t want to sound too hard but it looks like you’re lying to us.

    Your technical updates only mention Thumper2, which tells me that either you didn’t have backups, or they were on the same piece of physical hardware. Neither of which is acceptable.

    Please tell us the truth and stop using vauge ‘got into the backups’ and ‘same bug serendipity’ sophistry.

    What was the backup structure, and how was it affected.

    — Steve    534 days ago    #
  38. @Kristie: My point isn’t just that Joyent failed at communication this time. My point was that Joyent/TextDrive has failed at communication time and time again.

    When things are going great, you excel at communicating. It’s not that hard to recall Jason pounding the company’s “transparency” since the beginning of TextDrive.

    When things go wrong, communication fails. Customers get upset and angry. Some quit, but it’s obvious that you count on the vast majority of us continuing to believe in the company, and after the fact we get promises that communication will be better “next time”.

    It never is.

    How long do you figure you can keep stringing your loyal customers along?

    — toloden    534 days ago    #
  39. @“steve” You misunderstand the infrastructure and how it works. The amount of data SS and Bingo store are not amenable to standard backups. ZFS + massive RAID is the functional equivalent of a backup. Do you think tape backups be even remotely possible in a setup like this? Have you never heard of backups also being corrupted, even with old-school setups?

    I’m not happy about the downtime, but it is a testament to the system and the technology that even after a massive filesystem corruption bug the systems guys are apparently able to recover all the affected data.

    Calling people liars and sophists without any justification is really just taking cheap shots, especially when posting anonymously.

    Geoff Cheshire    534 days ago    #
  40. toloden, communication is much better. An email wasn’t sent out but the status page and the forums have been abuzz with info from the powers that be. It may not be where you would like, but it’s significantly better.

    steve, I don’t know much about the setup here, but data corruption can very easily be copied to a backup, especially if you backup regularly.

    fitzage    534 days ago    #
  41. Dear anonymous @Steve: we are not lying to anyone. In fact, I think we have disclosed a lot more about the issue in a technical sense than any one else might have.

    Dave has said several times we will give a summary report on what happened once the systems are back online and Mark Mayo noted in an earlier comment ‘The root cause of this problem isn’t yet fully understood, but I can tell you that we’ve hit several apparently “unique” and “impossible” conditions that definitely have not been reported in any existing bug or commit log.’

    That said, we will not disclose the intimate details of our infrastructure. We are trying to walk that line of giving our customers as much information as possible to understand the beast we are tackling, without lifting our skirt for the competitors to have themselves a little lookie. You have to understand that.

    @toloden: I get the break in communication thing. I really do. And as I mentioned, we are putting processes in place to change the communication distribution to ensure timely notifications on the good, and the bad (which we all hope is minimal, if any). In fact, my role just changed back to do just that – focus on communication.

    We take what we do, and the impact we wish to have on the community very seriously. We push the edge. We make mistakes. We are human. We dust ourselves off and push again. We are fortunate to have a solid community around us who believe in what we do, and are willing to forgive when we slip.

    Kristie Wells    534 days ago    #
  42. @Tim: I just deleted your comment. Asking that question, especially anonymously, seemed in very bad taste and served no purpose when you can simply Google and get an answer.

    ——-

    We do have a blog policy folks. We encourage healthy discussion as long as you keep it on topic and you do so with a real name and email address/website. We have allowed much more anonymously to go through on this post as I think the comments are valid to the topic. But there is no need to be a putz.

    Kristie Wells    534 days ago    #
  43. @Kristie: Thank you for replying to my comment, though I think it’s disgraceful that it was deleted.

    @Geoff: It’s called redundancy, if there was another backup on another thumper that’d be awesome. But it seems clear to me that there wasn’t.

    Also, clearly ‘Massive RAID + ZFS’ is not the equivalent of backups, otherwise we’d have our stuff back by now

    — Steve    534 days ago    #
  44. I’m still wondering how this is going to be handled from a “we are paying for down-time” ponti of view.

    Kristie?

    Rafael Dohms    534 days ago    #
  45. @Steve, I deleted @Tim’s comment – are you telling you are the same person? If so, even more reason to delete. Had you provided a real name and a link to your email or website, I probably would have left it. I have no real tolerance for anonymous posters stirring the pot.

    @Rafael: When the systems are online and Dave posts his summary, we will cover that piece.

    Kristie Wells    534 days ago    #
  46. I know that this is beating the proverbial expired equine, but, Joyent people, you do realize that as of now (17-Jan-2008 23:30 GMT) the two most recent status updates are here, an edited blog post, and not on the Current Issues page that one gets to from joyent.com/support? The only reason that I noticed these two new updates is because I went to the Current Issues page looking for updates and when I saw that the most recent update there was over 24 hours stale and something that I’d already read, I decided to look in on the discussion thread here to see if other people were bitching.

    Think about it, please.

    How would someone know to come to this specific blog post for status updates, particularly when there’s another page elsewhere that purports to be the latest and greatest?

    You may know to come here, but why would anyone else? If you can figure that out, you’re part way to understanding all of the frustration expressed by your customers over communication issues.

    — Bob Dively    534 days ago    #
  47. @Bob: We just received an update from the Systems team, the Current Issues page has been adjusted.

    The long and short of it – we expect BingoDisk to be up by the end of the week, as the data transfer from one Thumper to another is going well, albeit slow.

    We are awaiting for additional word on Strongspace and will post once we have a status report.

    Kristie Wells    534 days ago    #
  48. Whats Sun’s involvement in this, do you have escalated tickets open about the ZFS issues ?.

    — Jon    534 days ago    #
  49. Kristie,
    We all appreciate your responses on this forum, but I still believe it is important that Joyent/Strongspace push the information out to customers. Most posts on this forum have indicated a desire to receive an email from Strongspace, but it appears that you guys are either unable or unwilling to do the work to send out updates. Instead, you are relying on us to continue to check this site for updates.

    I do sympathize with your plight right now, but as a customer, I want information sent to me. I don’t want to go searching for it.

    Just for clarification: The problem started when you guys were upgrading the OpenSolaris version and then nothing seemed to work. Right? If that’s the case, did you also upgrade the backup system at the same time? I guess I don’t quite understand, because I would assume you just switch over to the backup system if the upgrade failed. Can someone clarify for me? Thanks.

    — Ken    534 days ago    #
  50. @Kristie

    >>>“as for communication, yes, it could have better. Absolutely. It is something we are working to correct, and will put systems in place to ensure the immediate distribution of information so you are properly informed.”

    Why don’t you just fix the communication issue NOW, by sending all Strongspace/Bingo customers an email informing them of the problem.

    Seriously Joyent. You’re the worse love/hate relationship I’ve ever experienced.

    And simply on this issue of saying you’ll “fix the communication issue in the future”, when you can still fix it NOW – is cause for me to leave and go elsewhere :(

    — Doug    534 days ago    #
  51. Hello,

    It is interesting to read in all your forums from the admins that we need to take a special care of our data, with good backup etc.

    Then you sell StrongSpace, an offer to put some of our data there in a very safe way. From what I read on the forums, I was sure that you had a full backup in another data center. But no, strongspace is just a unique system in a unique datacenter. I mean, the strongspace thumper could have burnt, all the data would have been lost.

    If some readers tell me that 10TB of data cannot be put on several backup systems, I disagree. You go from 0TB to 10TB only one bit at a time, it means that today, you can easily sync the data in several datacenters.

    I completely understand that sometimes things are going totally wrong. But I am upset, as I have the bad feeling that you did not apply to a critical part of your infrastructure what you preach in the open. It simply means that from now, I will not be able to trust you the same way as before and that is very annoying.

    My best wishes for the recovery of the data,
    loïc

    Loïc d'Anterroches    534 days ago    #
  52. A handy tip for Strongspace users. You can upload using SFTP to your Connector Strongspace account. You connect to ‘strongspace.joyent.net’ and use your connector email address (username@organisation.joyent.net) and password to login.

    Jacques Marneweck    533 days ago    #
  53. @Doug : not all of us want emails. RSS has been the standard notification mechanism for joyent/textdrive since day 1.

    @Steve : they have backups. It takes a finite amount of time to restore backups, and that’s the delay we’re seeing.

    Dick Davies    533 days ago    #
  54. @Dick Davies: RSS as “standard”? How many times have you had to hunt through forum posts, send help desk requests, look at the blog, or message an admin over IM to find out what was going on with your server?

    If your number is “zero”, you’re lucky.

    I do have to admit there was a period there, back when servers were crashing two or three times a day, that the admins (and Daniel, especially Daniel) did a damn fine job of posting statuses to the RSS feed. That period sucked for us (as I’m sure it did for the admins) but they seemed to realize that communication was important.

    It still is.

    — toloden    533 days ago    #
  55. @All concerned that my “got into the backups” explanation is a fudge: (a) I’ll flesh it all out in gory detail when we have service restored; (b) yes, it’s a bit of a figure of speech, even Sun is upset that I described the situation this way, but I think the facts will vindicate my explanation especially since, as others have pointed out in the comments, a service that manages 20TB of data is not well served by rsync or other traditional backup techniques. (c) The point of ZFS, a copy on write file system, is to, well, collapse the backup process and to protect against precisely the problems we have been coping with these past days. I’m not going to get into what all this means on this thread. But we will have a full technical explanation of what happened, the role of ZFS, the errors Joyent committed, what we plan to do in future to avoid this pitfall, and what our customers can expect by way of compensation, all of it after service is restored. Thanks.

    David Young    533 days ago    #
  56. @Dick

    So if you don’t want an e-mail, how do you propose for Joyent to “improve communication to their customers” as we keep hearing rhetoric about.

    When a MAJOR OUTAGE of a key service occurs, you should inform your customers at the least common denominator. While yes, RSS is nice, how many customers actually use it. I doubt many.

    Though if I receive an e-mail from my hosting provider, it certainly will gain my attention and I can manage my business accordingly.

    People seem to forget that some of us use SS/Bingo as a key component of the business they run. If a key component of your business fails, you want to know about it before your very own customers start yelling.

    — Doug    533 days ago    #
  57. This prolonged outage has cost my company a lot. Business has been lost as a result of this.

    I will be looking for an alternative service provider. Unless someone in Joyent can offer me a very good reason why I should stay as a paying customer, I only need to find a similar service (that doesn’t crash readily) and I’m out of here. It is deeply ironic that the service is called “Strongspace” when it has proven to be very slow and deeply unreliable. I wouldn’t even mind paying double what I am paying you now if it was stable.

    Does anyone here know a similar service? My needs are simple – I need a simple few Gb of space to upload files to for download by my worldwide users with registered accounts. An online drive if you would.

    And no, I have received no emails to update me yet. In this day and age, I really cannot understand what’s so hard about sending out a simple email to explain the situation and keep us updated. And no, it should not ever go down so bad that I feel like I’m waiting in a hospital waiting room for news of someone in emergency surgery. It should just work, period.

    Nothing personal, but my business really needs a better, more reliable service.

    Chris

    Chris Tan    533 days ago    #
  58. Basecamp just went down today and I thought they did a cool thing in regards to status updates. They added some sort of redirect on ever clients login page taking them to the latest updates. Might be something to setup for any future downtime.

    They also directed people to follow their twitter for more updates….that works great, that is until twitter goes down again.

    In the end I think you have to be ready to provide your clients with multiple streams. RSS, Email, Web updates, Etc. And I agree that an email would have been great.

    — luke mysse    533 days ago    #
  59. Bad NCQ, bad. No biscuit!

    — Tyler    533 days ago    #
  60. As yet another frustrated Strongspace user, I’m astonished by Joyent’s lack of preparedness.

    In all likelihood, server technology will never be 100% reliable, 100% of the time. Why not build an alert system outside your core network?

    When mobile phones have no service, one can still dial 911 in an emergency.

    We all understand that things break down from time to time. Joyent should be focused on and end-user approach in their solution to this and other problems.

    In my view, that generally starts with better communication.

    — Jason Stoff    533 days ago    #
  61. For those not following the twitter updates, BingoDisk is back online. Strongspace is still being worked on.

    Jacques Marneweck    533 days ago    #
  62. Are the Twitter updates the same as the status page? I can’t keep up with the status page, weblog, forums, etc. I was really hoping for an email.

    Combine Strongspace being down all week with Burnaby being down all day and I’m not a happy Joyent customer today…

    Dustin    533 days ago    #
  63. @Dustin: The Current Issues page on Joyent has been, and always will be, the primary source for obtaining news on our servers/services. https://help.joyent.com/index.php?pg=forums.topics&id=1

    We have posted statuses to the blog, the forums and to Twitter in addition to the Current Issues page, they were not meant to replace one or the other.

    Emails were not sent – that is being corrected. I cannot dwell on the past, I can only correct for the future. That is being taken care of now.

    I am also working on setting up a delivery mechanism so you can choose how you wish to receive updates – whether that be email, RSS, Twitter, Facebook, pigeon carrier, whatever. That won’t be live tomorrow, but it is a great idea and one we will implement.

    BingoDisk is back up. We are working on Strongspace. More updates to follow from Dave.

    Kristie Wells    533 days ago    #
  64. @Chris Tan: A while back I decided between Amazon’s S3 and Strongspace for my networked storage needs. S3 is a little more work because it uses a flat file system, but its distributed storage was the winning selling point for me. Of course, who knows, tomorrow S3 may go down quicker than the Titanic, but for right now I’m feeling pretty good about my decision.

    — Bob Dively    533 days ago    #
  65. Hang in there guys!

    Considering the by-all-accounts tremendous amount of data hosted here and that none of it was lost in the face of what seems a complete meltdown is astounding. Good to know you guys are working on it.

    And, for the record, even 911 jams up occasionally. ( =

    D. Hayes    532 days ago    #
  66. Even with the huge amounts of data which apparently needed to be moved between machines, I’m quite stunned at it taking over a week to restore a single server to production.

    (Yes, I’ve had servers of my own with multi-terabyte RAID 5 or 6 arrays taking non-trivial time – 24 hours, in one case – to recover from multiple disk failures; I just hoped with all the ‘meteor’ talk that Strongspace would have better uptime, not to mention a backup system with no common points of failure with the production server!)

    I had a 3Gb lifetime Strongspace account; even before learning that it shared a single system with Bingodisk I was rather puzzled by the huge price difference for the same amount of storage and bandwidth over different protocols! I had been tempted, but didn’t have any real use for WebDAV storage and couldn’t justify the price of Strongspace even with serious uptime. The vague mention of being able to link different products together, using Bingodisk space as storage for other Joyent services, sounded good (particularly being the owner of a lifetime L Accelerator), but without enough detail to do anything yet.

    James    531 days ago    #
  67. Both Bingodisk and Strongspace service have been restored. @James: we had to restore nearly 20TB of data. I will post all the details tomorrow, 21 January.

    David Young    531 days ago    #
  68. Hmm, are you sure it’s back up? I still can’t connect via either https or SFTP. I did for a fleeting moment an hour ago, but no luck since.

    — Michael    531 days ago    #
  69. I still can’t login to Strongspace… This is driving me nuts!

    Is it back up or not?

    — Nolan    531 days ago    #
  70. So I’ve had emails saying “Strongspace is back”... “Oop, wait, no it isn’t”... “Ah, yep, it is now guys!”

    And I STILL can’t log in via web or SFTP.

    So, what is going on? I’m glad there’s more communication, but if all you’re doing is communicating BS, I’m not sure you’ve quite understood the problem…

    — Jonathan Barrett    531 days ago    #
  71. Strongspace was up yesterday evening but now at 1019 EST it is down again. I’ve been reading all these posts and updates on various discussion boards (why are there so many by the way?) and dispite the statements of strongspace being back up, I can’t access my account. Is it fixed or not?

    Lewis    530 days ago    #
  72. This is what I mean by various discussion areas:

    http://www.joyeur.com/2008/01/16/strongspace-and-bingodisk-update#c008552

    https://help.joyent.com/index.php?pg=forums.posts&id=701

    http://discuss.joyent.com/viewforum.php?id=2

    One would be nice.

    Lewis    530 days ago    #
  73. @Lewis: As I mentioned in an earlier comment, the main communication page for a server outage is https://help.joyent.com/index.php?pg=forums.topics&id=1. That has not changed in three years. We are also posting to the blog, Twitter, the forums and now via email to ensure that we reach as many people as possible to advise them of the status. But again, if you watch the Current Issues page, it will give you the ‘Current Issues’.

    Strongspace is currently down. Ben’s update on the Current Issues page notes Monday afternoon before it will be brought online. We will continue to update as we have information available.

    Kristie Wells    530 days ago    #
  74. Kristie, look at the posts in the Announcements forums concerning the Strongspace/Bingodisk problems. There’s not one mention in any of those posts about the existence of the Current Issues page. There is a link to this blog post. In which the Current Issues page is only mentioned in the comments.

    It’s great that Joyent has all these channels for distributing information. But if you’re only sending out some of the information over some of the channels, you’ve got to tell the recipients a) that they’re not getting a complete set of information, and b) where to go to find a complete set of information.

    In this case some of Joyent’s customers are clearly having a hard time finding the Current Issues page. Maybe that’s because they’ve forgotten or don’t know where to look or didn’t read or whatever. Whatever the case, it doesn’t matter – Joyent needs to help its customers find this information; even if seems blindingly obvious to you where it is.

    I apologize if I’m coming off like a jerk here, but my intent is to be helpful. I have decades of experience working with large organizations that have trouble properly distributing information, leaving customers confused and irritated, and I’d very much like to see Joyent avoid these problems because I like it here and want the company to succeed.

    — Bob Dively    530 days ago    #
  75. Kristie, look at the posts in the Announcements forums concerning the Strongspace/Bingodisk problems. There’s not one mention in any of those posts about the existence of the Current Issues page. There is a link to this blog post. In which the Current Issues page is only mentioned in the comments.

    It’s great that Joyent has all these channels for distributing information. But if you’re only sending out some of the information over some of the channels, you’ve got to tell the recipients a) that they’re not getting a complete set of information, and b) where to go to find a complete set of information.

    In this case some of Joyent’s customers are clearly having a hard time finding the Current Issues page. Maybe that’s because they’ve forgotten or don’t know where to look or didn’t read or whatever. Whatever the case, it doesn’t matter – Joyent needs to help its customers find this information; even if seems blindingly obvious to you where it is.

    I apologize if I’m coming off like a jerk here, but my intent is to be helpful. I have decades of experience working with large organizations that have trouble properly distributing information, leaving customers confused and irritated, and I’d very much like to see Joyent avoid these problems because I like it here and want the company to succeed.

    — Bob Dively    530 days ago    #
  76. I used to run a simple, single web server. I can remember a “perfect storm” which caused me to decide never, ever to do such a thing again! We were basically screwed for two weeks, and I must have spent ten hours a day during that whole period of time.

    Throughout the whole ordeal, I had to deal with MediaTemple. Let me tell you, they are far, far worse than Joyent. Actually, I’ve dealt with several companies who I think have a much worse track record. But MediaTemple, and many other companies like them, have pretty much NO means of keeping you in the loop on system status. Particularly when things go awry.

    All things considered, I won’t be taking my business elsewhere. Note that I don’t take all my business to Joyent, by any means. But I’m not giving up recommending Bingo Disk or StrongSpace, believe it or not. Nor am I going to stop recommending them for hosting, where appropriate.

    No-one handles these kinds of events to everyone’s satisfaction. And while most of the commenters on here are pissed (understandably), Joyent has done a decent job, in my opinion. I have received several emails, and RSS as well, and knew they were down almost as soon as it happened (from Joyent, not from my customers complaints!)...

    I guess I’m saying that some of you might go a little easier on these fellas. I can say from personal experience that when I have smaller issues, such as a Rails problem I can’t figure out, I’m guilty of being so frustrated that I submit a ticket, even though I realize it’s probably not Joyent’s fault.

    You know what? It almost never is. But almost every time, they have actually fixed it for me. It’s unbelievable. I once complained on my company blog about an issue with a Rails application running on Joyent, and David personally wrote to offer help on the issue. And they fixed it – an issue that was not their fault at all, but turned out to be in my code. I’ve never known another company like Joyent, when it comes to this kind of genuine concern for a customer. A customer, mind you, who paid a one-time fee! That is, they’ve already got my money! I’m not paying next month! And David still offers to help out.

    Maybe my experiences have been different from yours, sure. But I’ve been using these services since they each launched, and have been a customer for a good three years. And nearly all of my experiences have been quite pleasant.

    That’s all. Just wanted to add my two cents. Carry on!

    Raymond Brigleb    530 days ago    #
  77. Sorry about the double post above. I submitted it once, and when it hadn’t shown up an hour later, I submitted it again. Guess I should have been even more patient.

    — Bob Dively    530 days ago    #
  78. Some people are talking about the lack of “backups”, and that we are seeing a delay in “restoring” the backup. It seems to me that we’re confusing our terminology.

    As far as I know, there is no “backup”. Strongspace is (supposed to be) a service with built in redundancy provided by ZFS and RAID. What customers are asking for Joyent to remove the single point of failure that comes from running all of Strongspace from a single server / data center / RAID array, etc.

    E.g. run two systems in parallel in different data centers. When one goes down for an upgrade, the other takes over. When the first comes back online, data is re-synced. That doesn’t mean 20 GB is re-synced, only new data.

    Without this, Strongspace is still susceptible to the catastrophic failure and data loss that can occur with all your eggs in one basket. There are no guarantees that a particular system or RAID array will not be physically damaged beyond repair at a single site.

    Even without those dangers, simply upgrading the operating system seems fraught with peril which causes the entire system to come down with no recourse when bringing the new / updated systems back online without knowing the consequences of those updates on production data – e.g. what we’re seeing here right now.

    — Tai Lee    530 days ago    #
  79. We’re not expecting Joyent to be perfect in terms of uptime of all their services (no one is), but when something like this happens, I expect quick, timely, and detailed notices. There are companies out there who provide this: Slicehost, for example.

    The thing is, email is not the answer. A status page is not the answer. Each method of communication needs to be equally timely and informative as a person might have time to only see one notice. A post saying that a service is down without any indication of what is wrong and what to expect is not acceptable. The post can even be wrong initially; we’d understand that. We just want to see and understand that the problem with transparency. It happens constantly here. There are 5 notices on the status page right now saying a service is down, and then it’s back up without any real indication of why. We might get that Apache crashed, or MySQL maintenance is being done. We don’t get why, or what is being done to prevent it in the future.

    These uninformative (and often untimely) updates only lead to more anxiety for us, the customer, because we either have to believe that Joyent is doing their best blindly, or see a pattern of downtime which leads us to believe that they are not.

    Seeing that both Bingodisk and Strongspace are not accepting new customers is another such issue. Rather than take care of their customers first, Joyent posts some odd EOL-looking notices. I understand if they don’t want new customers right now given recent issues, but, if that is the case, they should have continued to prevent signups since the Bingodisk and SS outage was fixed. Once again, we’re in the dark.

    — Mark    529 days ago    #
  80. What can I say? After 24 good hours, it’s down again…
    Looks like we’re dealing with a company marked only by sheer incompetence.
    And of course, there’s no word whatsoever from them again.

    My business once again halted by technical incompetence.

    Oh.. the frustration.

    Chris Tan    527 days ago    #
  81. @ChrisTan: SS was down between midnight and 2am PST this morning for a SCHEDULED maintenance.

    It was posted on the Current Issues page AND in the forums beforehand. It was also mentioned in one of the emails sent out earlier this week.

    http://discuss.joyent.com/viewtopic.php?id=21137

    https://help.joyent.com/index.php?pg=forums.posts&id=709

    It seems we are in a no win situation with you here.

    Kristie Wells    527 days ago    #
  82. @Mark: not sure I understand your comment about us no longer accepting new signups. Us making that decision was so that we could focus on our existing customers and the product before taking any new business in.

    You say you are in the dark – did you not see http://www.joyeur.com/2008/01/22/bingodisk-and-strongspace-what-happened?
    It laid out what happened and is happening pretty clearly.

    Kristie Wells    527 days ago    #
  83. Closing comments on this post – if you would like to add something on this topic, please contribute to http://www.joyeur.com/2008/01/22/bingodisk-and-strongspace-what-happened

    Kristie Wells    527 days ago    #

Commenting is closed for this article.