Friday, August 29, 2014

IMacros for Firefox Failure Corde 0x80500001 (Error code: -1001)

Solution

The root issue is the encoding of the files. I've had this problem before with iMarcos, but it's never been quite this specific. Usually saving the datasource (the CSV file) as UTF-8 works. But some update to Firefox or iMacros has made it really inflexible. Both the datasource (.csv) AND the macro file (.iim) file must be saved as "UTF-8 with BOM". I Used Sublime Text to do this, but any full-featured word editor should work.

For the record, various versions of things I use.
Firefox : 29.0.1
iMacros : 8.8.2
Windows : 8.1 x64

Problem/Full Story

I often have to fill out web forms over and over again to perform certain tasks. A lot of these web forms are poorly designed at best, and don't support batch-type inputs. So having a program like iMacros is essential for me not wanting to kill myself while filling out a DHCP registration form 200 times.

I've used iMacros for a number of years and never had too many problems; In Firefox at least, in Chrome the sandboxing makes it an exercise in keyboard snapping frustration to read/write to files -- but that's another story. However I needed to do a bunch of the previously mentioned DHCP registrations today (this is a system managed by another department, and the web interface is the only way to do it besides submitting a work request, which can take days), and found that the Macro/CSV I had previously used to do this were not working. I received the following error message:

Error: Component returned failure code: 0x80500001 [nsIConverterInputStream.init], line 4 (Error code: -1001)
I'd actually run into this error before, or at least one similar to it. iMacros (or possibly firefox), can be rather picky about the encoding it uses. Previously saving the .csv I use for inputs as UTF-8 had solved the problem. Today that didn't work though.

After fiddling around with it for a bit, I found something strange. I Created a new macro (.iim) file to see if the other one was corrupt or something, but writing/saving it through Sublime Text (not the built in iMacros editor) as UTF-8, then opening it in the iMacros editor just showed a blank file. Strange, after trying a handful of different encodings for the macro file I found one that it would recognize "UTF-8 with BOM". After saving the file with this encoding through sublime, it would show up correctly in iMacros. However, I was still getting the same error when I tried to run it. Tried saving the csv file with the same "UTF-8 with BOM" encoding, and then it ran.

Thursday, August 28, 2014

Citrix Receiver for Mac "Cannot start the desktop ... OSStatus -1712"

Solution

In my case there were non-responsive processes on the mac client that were causing the problem. To resolve, I closed out of receiver and closed any active desktop connection. I then brought up the activity monitor (command+space to bring up the search, enter "activity monitor"). There were several Citrix processes, one non-responsive process with the name of the personal desktop that wouldn't load and a few helper processes. I force-quit all Citrix processes, then restarted the receiver client. Connected to desktop successfully at that point.

It may not have been necessary to force-quit all Citrix processes, but it doesn't seem to have had any consequences, they started back up when I reloaded receiver.

Problem / Full Story

Had a user this morning that couldn't connect to their windows desktop over XenDesktop (7.1). User is one of our few Mac (running Mavericks) users, and uses XenDesktop to get to windows applications he needs. When he tried to log in this morning, he got the following error when he tried to connect to his windows 7 machine.

Cannot start the desktop "Personal Desktop"
 Contact your help desk with this information: The application "Personal Desktop" could not be launched because a miscellaneous error occurred. (OSStatus -1712).
Odd thing was, he was able to connect to his Windows 8 desktop just fine. So the connection to the server was working, as was the connection to at least one VM. The Win7 machine was showing up as registered and ready in Citrix Studio on the XenDesktop Controller. Win7 appeared to be responsive when interacting with it through XenCenter. I tried restarting the Windows 7 machine but the error persisted. A brief look through the longs on the Win7 machine and the XDC didn't show any errors, so it seemed like the problem wasn't server-side. Had the user logout/close receiver on his machine and reopen it, but the error continued to occur.

Up in the user's office I brought up the activity monitor and saw the unresponsive process -- see "Solution" above. After killing and restarting all citrix processes the user was back up and running. Rebooting the Mac probably would've had a similar effect.
 

Tuesday, August 12, 2014

a security package specific error occurred - Security-Kerberos EventID 4

Solution

Root problem was that there were static DNS entries set for some computers whose IP addresses had changed. Deleting static entries and waiting for changes to propagate out solved the problem.

Full Story

Had an issue this morning where some new computers on our network were not getting printers mapped. This is not an uncommon occurrence, because printers, but the cause of the problem was a new one for me. These computers had just been upgraded (new hardware, same hostnames) and seemed to be functioning fine on the domain. The print driver was working fine on other machines, and the usual fix, restarting the print spooler, had no effect.

Trying to access the Event Viewer on the lab machines I got the error "A Security Package Specific Error Occurred". This error (or a variation) came up trying to access the computer via any WMI / RPC / DCOM method.

On the print server I had the following error, listed as Level:Error, Source:Security-Kerberos, Event-ID: 4

The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server MYLAB-04$. The target name used was cifs/MYLAB-02.My.Domain.Com. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (MY.Domain.Com) is different from the client domain (My.Domain.Com), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

 One thing jumped out here right away, the error is from lab computer 04 (SPN: MYLAB-04$), but the FQDN is listed as computer 02 (cifs/MYLAB-02.My.Domain.Com). So that set off some alarm bells, but I still did some additional research before jumping in.

Supposedly this error can be caused by a number of things (a Google of "A Security Package Specific Error Occurred" returns about 6 difference causes on the first page of results). In my case, as mentioned above, was a DNS issue. While upgrading these lab machines, the IP addresses we assigned through DHCP changed slightly. Normally, we just let the machines register themselves with the DNS server after they pick up their IP via DHCP, we don't have many static DNS entries. For some reason, these machines had static entries, though, so our DNS server was resolving their hostname differently than AD was, which is what caused the authentication errors. Deleting the static entries and waiting (DNS changes can take a while to replicate) solved the problem.

Thursday, July 24, 2014

Samsung Galaxy Light - WiFi calling "called ended", WiFi messaging fails

Resolution

In contrast to most of my posts, the solution to this is actually available pretty readily on the internet . I am posting this to confirm that it worked on my two phones, because people are pretty bad at updating forum posts once they've found a solution.

The fix in this case is to run a firmware update via the "Samsung Kies" software. This firmware update is not available OTA. Testing over a ~24 hour period I have been unable to reproduce the problem, which was quite easily reproduced before the update.

While researching this problem, I found many similar reports of other Galaxy phones (s3,s4,etc) having the same problem. None of those reports seemed to have a resolution, only the Galaxy Light thread had anyone confirm that they had fixed it. So, if you have another Galaxy phone with the same problem, I would check to see if there is a firmware update available via the Kies software.

Installing and running Kies software

  1. Download the Kies software for your platform from Samsung's website
    1. There are two versions "Kies" and "Kies 3". For the Galaxy Light, you want "Kies", newer phones may need the "Kies 3"
    2. There is also a version for Mac, only one version though, not sure if that works for all phones or what.
  2. Install the Kies software.
    1. There isn't much to do here. Just click next a whole bunch really. Though I did opt to install the "universal drier tool", not sure if that's necessary. 
  3. Open the Kies software, you should see a "connect your device" type prompt. Connect your device.
  4. If this is the first time you've plugged in your phone to the computer it may take a few minutes to install drivers.
    1. Note: I did have one of my phones lock up while connecting it to the computer (screen and hardware buttons became unresponsive). A Force reboot fixed it.
    2. The device needs to be in MTP mode, not PTP mode. Kies will warn you if it's not. 
  5. When I hooked up my phones, it immediately prompted me to do the firmware upgrade.  If this doesn't happen for you, the "Basic Information" tab should show the current firmware and whether or not it's up to date.
  6. Follow Kies instructions to upgrade the firmware. 
    1. The first time I ran this, I let the phone go to sleep while Kies downloaded the firmware update. Since the download took a while (I don't have super fast internet at home) the phone disconnected from Kies and I had to start the process all over again. Had to sit there swiping the screen back and forth while the download happened to keep it from sleeping/disconnecting.
  7. Phone will reboot and install, don't turn it off or do anything to it while this happens.
    1. The upgrade performed just fine for me on both phones, with no loss of data. Still, if you have critical stuff on your phone (why do you have critical stuff on your phone?, keep that stuff somewhere less steal-able) maybe back up the phone (can be done with Kies (Backup/Restore tab), or about 1000 other things) before doing the upgrade, just in case.
That's about it. Once the firmware update is complete you shouldn't have any more problems with WiFi Calls not dialing, not receiving calls on WiFi, or not being able to send/receive text messages on WiFi.

In interest of full documentation, One call I made right after the upgrade had really poor call quality (sounded like I was underwater). This was one call and have not had the problem reoccur since. 

The Problem / Full Story

We recently switched to T-mobile because they're reasonably cheap (for what you get) and their business model is slightly less troublesome than most big carriers. <Rant>took about 8 hours on the phone over 2 days to get the plan set up properly because the original guy who sold us the plan didn't know what he was doing </Rant> . Anyway we brought our devices, because you can pick up the Galaxy Light on Amazon for dirt cheap.  It's not a high-end phone, but it's a reasonable spec with a fairly recent version of android (4.2.2). 

But enough advertising, you wouldn't be here if you weren't having a problem. The phones worked fine for a few days, but we started having problems with the WiFi calling within about a week. WiFi calling was a big deal for us because all carriers (besides Verizon, but fuck them) have not-so-great coverage in my town, but we have WiFi pretty much everywhere we go. So problems with the WiFi calling was problems with our service in general.

Problems were as follows. If the phone was allowed to sleep for awhile (seemed to be 30-45 minutes on average) WiFi calling would stop working. This means, with out a cell signal (which I don't get at work because my building is made out of concrete and florescent lighting) no calls could be made or received and no texts could be sent or received. Not that the phone was aware of this, the WiFi calling icon in the notification bar was still blue and it said it was making calls over WiFi. When you actually tried to place a call, however, it would immediately end the call (at 0:00 seconds) and call status would be "Call Ended". Looking at call history you would see "Canceled". Trying to send Texts would result in a "Failed to Send" message. People trying to call us would get send straight to voicemail, or occasionally one ring then voicemail. Then only way to restore service was to turn WiFi calling off-and-on (usually) or reboot the phone. This fix would only last until the phone went into some low sleep state (seemed to be ~30 min of screen off).

After much internet searching and playing around with settings on the phone I came across this thread (also linked at top) about the WiFi calling on the Galaxy Light. A non-OTA firmware upgrade was available and fixed the problem (see above for steps).

Here's some things I tried that did not work
  1. Clearing data (via application manager) from "WfcService" and "Wi-Fi Calling Settings"
  2. Changing Wi-Fi calling preferences (prefer cell, etc.)
  3. Different Wi-Fi networks
  4. Standing close to router (router is on my desk at home, so while sitting at desk phone is <5 feet from router)
  5. Turning off voice control
  6. Turning off other wireless radios (bluetooth, gps, NFC, etc.) 
Here's some other things I've seen reports of people trying that have not worked
  1. New phone - This appears to be a problem with at least all Galaxy phones. Seen reports of people getting phones replaced 4 or more times without resolution. 
  2. Factory Reset
  3. Changing network mode (LTE/WCDMA/GSM, WCDMA/GSM, WCDMA only, GSM only)
  4. Changing Sim cards/upgrading sim cards
  5. opening ports on router


Monday, July 21, 2014

Condusiv V-Locity - setup and first thoughts

Introduction

I'm going to be pretty brief here because I feel I'm not going to have much to actually say about this piece of technology (note from the future: I wasn't able to get this to work in my environment, read on if you're interested in the problems I ran into, but otherwise this article probably isn't worth your time). We'll leave it at : managing Storage IO in a virtualized environment is a pain, so I've taken to investigating some technologies that look at improving storage performance without simply buying more storage devices. This post is written in steam of consciousness style as I go through the setup process. I try to document anything I notice and or am thinking during the install. I do some minimal editing afterwords, but for the most part it'll be a rambling mess.

V-locity is a program from Condusiv, a name that was obviously dreamed up by someone with no respect for spoken language, ease of typing, or autocorrect. Here on I'll probably just refer to it as "the program." The idea behind the program is that windows file system driver is poorly optimized for an age of virtualization and non-local storage. Breaking file read/writes into multiple chunks isn't noticeable on local storage, but can add serious overhead when it has to go over Iscsi. So through a new driver an a bunch of caching, the program hopes to optimize storage to give you better density without buying more hardware (or increasing Cap ex, as they say) </marketing>. I won't go into much of the details of how it works here (I'm still a little fuzzy after a webinar and like 6 sales calls, and let me tell you it's not for lack of paying attention) if you're interested you can read all about it here.

First let me say, Condusiv certainty isn't trying to save you any money over buying more hardware. We were quoted a price of around sixty thousand dollars (+ about seven thousand in yearly licensing) to run on our three servers (32 cores each, which is how it's sold). That's insane, that's roughly three times as much as the server's themselves cost. This more or less makes it only an option if you're out of rack space, or for whatever reason can't move your data to faster storage devices.

Setup

I've got a test environment setup. 50 VMs and a server. VMs are running on some Dell R610 servers connecting to their storage over a 6GBs Direct attach SaS link. Server are running XenServer 6.2.0 (sp1 + all other patches). VMs are 64 bit windows 7, all updates installed, basic office applications for testing. Tests will use XenDesktop to measure login performance (connecting via thin clients) and a more manual approach to measure some application launches (visual studio 2012 is one in particular we've had take a really long time to load on VMs due to excessive file system access during first-run)

Setup is broken into three parts, the VMC (controller) and master node (velocity server), and the clients. Since this is a test setup, my VMC and master node are living on the same server. VMC setup is simple, just click next and it installs inself and a webserver to interface from. One thing, it doesn't tell you to access it via the web page. The installer just finishes and you have to figure it out. The setup instructions don't really say this directly either, you just have to kind of guess at it (I figured it out because the install directly had a bunch of .js, .html, etc. files).

After that, the setup runs a discovery on your domain to find machines to install on. I didn't set any sort of filters on this, but it is currently stuck (about 20 minutes) on 740/742, we'll see if it ever finishes.

... 30 minute mark now, still spinning on 740/742.
... well over an hour now. Neither the "close" or "next" buttons do anything.
... two hours and no sign of movement. I'm about to go home, so I'll let it run overnight and reboot the dumb thing if it hasn't sorted itself out by morning.
...
...
...Still at the same spot, think it's safe to say it's stuck, going to try restarting the service. Now it says discovery complete, 1 record processed. Sounds legit. Looking through the machine list, it seems to have detected a fair amount of my machines, but none of the VMs I created specifically for testing.

After another restart of the VMC service and some time it picked up all my servers, but I've run into a bigger issue. The master node component won't install on my virtual server. The server meets all of the requirements listed in the various install guides and readme files, but it doesn't show up on the list of machines available for deployment. Trying to run the installer manually gives the error "OS not supported".

Looking further, it is only presenting the option to install the master node component to physical machines. I can't find this listed as a requirement anywhere, and the sales rep/tech people say that it isn't a requirement, but that's the only option it's giving me. 

Worked briefly with the sales rep/tech support team that's been helping me, they gave some new licenses to  try, but for whatever reason the program still only gives me the option to install to physical servers. I don't have spare physical servers lying around, so we're a bit dead in the water.

On a hunch I looked up Vlocity + XenServer (my hypervisor of choice), and have found some conflicting reports of support for the XenServer platform (PDF). At best it has partial support, and that possibly only for the guest/client. So maybe that's the issue. Looking back through my emails I defiantly mentioned that's what I was running on (and I'm pretty sure we covered that more in-depth during one of the 7-8 phone calls they made me sit through), but maybe I wasn't clear enough.

So, unfortunately this is where it my review of  vlocity must end. I'd spend more time with their tech support troubleshooting it, but I have other projects that need my attention. So, take my experiences for what they're worth (probably not much) but if you're looking to evaluate and are using XenServer, maybe be sure you're clear with your reps about the setup.

Edit: Sales rep assures me that Vlocity works "with Citrix" (I haven't gotten him to say "with XenServer") and in interest of objectivity I was able to get it to start seeing virtual machines. Still doesn't see the test server I built up as a valid install location, so I'm still stuck, but there you go.

Thursday, July 10, 2014

Excel Crash: Visual Studio (10) Tools for Offce Add-in -- vs10exceladaptor

Solution

Solution thus-far has just been to disable the add-in for all users. We don't know of anyone actively using this add-in so that works for us. If you need the add-in I would look towards compatibility. 0xC0000005 typically indicates that a program tried to access memory it's not allowed to. This could mean another plug-in isn't playing nice, or you might try disabling DEP (though this is a pain for office, and more than a bit of a security risk).

To disable add-in for all users, found the best way was to log in as admin, find the excel executable (excel.exe) > right click > run as admin. Then go to File > Options > add-in > com add-ins > go. Then uncheck the boxe(s) for the "Visual Studio Tools for Office Design-Time Adaptor for Excel".

Story

Had some users complaining about excel crashing on our terminal server. This is a terminal (RDS) server that students use to remotely access lab applications via thin-clients, so it has just about every program under the sun installed on it. I mention this only because this means we have about 1000 different excel add-ins loading/available which is what I expect is causing the underlying issue. Also worth noting, Thin clients connect via XenDesktop (7.1); this could also be a cause of the error.



Other notes on server: Server 2008R2 (fully updated, x64), Office 2013 x86

Looking at the even logs, I see the excel crash (Error, Aplication Error, Event ID: 1000)

Faulting application name: EXCEL.EXE, version: 15.0.4535.1507, time stamp: 0x52282875
Faulting module name: EXCEL.EXE, version: 15.0.4535.1507, time stamp: 0x52282875
Exception code: 0xc0000005
Fault offset: 0x0005a802
Faulting process id: 0x2380
Faulting application start time: 0x01cf9c61a803a93c
Faulting application path: C:\Program Files (x86)\Microsoft Office\Office15\EXCEL.EXE
Faulting module path: C:\Program Files (x86)\Microsoft Office\Office15\EXCEL.EXE
Report Id: ed04ac54-0854-11e4-9867-d4bed9f3434f
Which doesn't give us much. In past experience, 0xC0000005 is a generic "Memory Access violation" error -- a program tried to access memory it didn't have permission to.  The next entry in the even log is a bit more useful (Error, Microsoft Office 15, EventID 2001)

Microsoft Excel: Rejected Safe Mode action : Excel is running into problems with the 'vs10exceladaptor' add-in. If this keeps happening, disable this add-in and check for available updates. Do you want to disable it now?.
This appears to be something that gets installed with visual studio, no idea what it does. I went ahead and disabled it for all users (see notes in Solution) since I'm not aware of anyone using that add-in. Worth noting that I initially tried disabling the add-in through the registry (HKLM\Software\Microsoft\Office\Excel\Addins\VS10ExcelAdaptor\ -- Set LoadBehavior to 0) but that didn't seem to have any effect.


Thursday, June 19, 2014

Creating A Ceph Storage Cluster using old desktop computers : Part 2

So, in the last part I left off where I had a clean+active cluster with two OSDs (storage locations). No data has yet been created, and, indeed no methods of making the locations available to store data have been set up.

Following along with the quick start guide the next thing to do is to expand the cluster to add more OSDs and monitoring daemons (I'm about half-way down under "expanding your cluster"). So away we go.

Expanding Cluster

Adding OSDs

Adding a third OSD went just fine using:

ceph-deploy osd prepare Node3:/ceph
ceph-deploy osd activate --fs-type btrfs Node3:/ceph
#For those of  you just joining us, I'm using btrfs because I can. Recommendation is typically to use xfs or ext4, since btrfs is experimental.

After running those commands, running "ceph -s" shows cluster now has "3 up, 3 in" and is "clean+active". Available storage space has also increased significantly. 

Adding a Metadata Server

Next step is to add a metadata server which is used by CephFS. CephFS one option for presenting the Ceph cluster as a storage device to clients. There's not much to be said here I ran the command and it completed.

ceph-deploy mds create Node3
# I chose Node3 arbitrarily 


 Adding More Monitors

So now we set up more monitors so that if one monitor goes down the entire cluster doesn't die. In the previous bit, I ran into an issue where the monitor service started creating very very very very verbose logs to the extent that it filled up my OS partition (several MB a second of logs). I was able to fix this with a change to the ceph.conf file, so I'm hoping that change gets carried between monitors, but I guess we'll see.

ceph-deploy mon create Ceph-Admin Node2

This didn't got as well. It installs the monitor on each node, but the monitor process does not start, and does not join the cluster. Some errors during install

  • No data was received after 7 seconds, disconnecting
  • admin_socket: exception getting command desciptions: [Errno 2] No such file or directory
  • failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i Node2 --pid-file  /var/run/ceph/mon.Node2.pid -c /etc/ceph/ceph.conf --cluster ceph '
  • Node2 is not defined in 'mon initial members'
  • monitor Node2 does not exist in monmap
  • neither public_addr nor public_network keys are defined for monitors
  • monitors may not be able to form quorum
 I found a very helpful blog post detailing resolution to many of these errors.

First problem, my admin/deploy box had a bunch of hung create-keys processes. So I killed all those.

rebooted the new monitor node, and the mon service started, but I can't seem to interact with the cluster at all now. That's probably not a good sign. All ceph commands time out even running on Node1.

....

After much troubleshooting that went nowhere, I'm rebuilding the cluster. Uninstalling everything and purging all data. Reinstall is going pretty quick now that I know how everything works (ish). One thing I did find. I'm a bit more clear on the difference between

ceph-deploy new
ceph-deploy mon create-initial 

Now. "new" actually creates the 'new' cluster. You have to specify monitor nodes though so I thought 'new' referred to new monitors. Anyway, I following all the previous steps again, trying not to take any shortcuts or anything so hopefully I'll end up right back at the point before I screwed everything up.
 
...

New problem when trying to do "ceph-deploy osd activate" fails saying that it couldn't find a matching fsid. A bit of Googling suggests that there is data leftover from the first install (despite doing the purge+purgedata while I was remaking the cluster) so I'm reformatting the drive to see if that works.

 Yep, deleting and re-adding the partition reformatting worked. So purge data does not apparently actually purge data, at least not on my systems. Note: deleting and recreating partition reformatting required to edit /etc/fstab again to make sure mount worked correctly (UUID of the file system)

...

Back to pre-expanded pool at "active+clean" another discovery made. The quick start guide tells you to add "osd pool default size = 2" to the ceph.conf file under "[default]". This is a lie, it goes under "[global]". That is why I had to go back and set the size on each pool last time in order to get the "active+clean" state.

...

and the add monitors steps gave the same error

  • Starting Ceph mon.Node2 on Node2
  • failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i Node2 --pidfile /var/run/ceph/mon.Node2.pid -c /etc/ceph/ceph.conf --cluster ceph '
  • Starting ceph-create-keys on Node2
  • No data was received after 7 seconds, disconnecting
monitors do not start on nodes 2 or 3. I'm not going to try rebooting them this time and hope that it doesn't totally destroy my install again.

....

Broke it again, this time trying to use "ceph mon add <host> <ip:port>" to manually add the monitor so that it would stop saying it wasn't in the monmap. This apparently is not the way to do that.

Guess I have to reinstall everything again... joy

....

Broke it a few more times but everything is working now with 3 monitors. For whatever reason ceph-deploy to add second/third monitor was not working at all. So I used this guide to manually add the monitors, which worked, except steps 6 and 7 are backwards, the monitor needs to be started before you run the "ceph add mon" command. "ceph add mon <etc>" will hang if the monitor you tell it to add does not respond, if you kill (ctrl+c) the "ceph add mon" command, that's when the whole cluster becomes unresponsive. You can technically run "Ceph add mon" then start the monitor on the node, but since "ceph add mon" takes over your shell, getting to the Node to start it can be problematic.

So now I've got a cluster with 3 OSDs and 3 montitors, I've got a warning about the clocks being different between the monitors, but other than that it's working and happy. Manually set all the clocks but clock skew warning is still happening. Restarted one node and one warning went away. Trying to restart another node but creating new monitors the way I did means that didn't get put in /etc/init.d so I can't restart them via the "service" command. Trying to find how to add them to this.

Giving up on that for now, may come back to it later.

Finally Using Ceph


Ok, while it's not exactly in prime condition, I want to get down the the functionality so the clock skew (it's a couple seconds) and the whole daemons not being in init.d problems I'm leaving for later.

Going to use the laptop I set up as a proxy as the client, which means I need to update its kernel.

...

Laptop Kernel updated, now using this guide to get the client and block device setup.

Setup ssh keys, hosts file, ceph user + sudo access, ceph repo,

More problems installing ceph

Ceph-deploy install ceph-cleint -- for some reason installing this on the laptop has been much more difficult than on my other machines. Maybe because the laptop wasn't installed with minimal install? Here's a few things I've run into

Repo errors - I sent up the ceph repo according the the pre-flight check guide, but kept getting 404 errors during the install. Looking at the ceph.repo file, ceph-deploy apparently adds in additional repos in addition to the ones set up via the guide, removing these and running ceph-deploy with the --no-adjust-repos flag fixed that. Don't know why ceph-deploy was adding bad repo urls

Python Error after install - After ceph installs it trys to run "ceph --version" to verify install. But this failed with a python traceback error saying it couldn't find the module argparse. I ended up having to install argparse and setuptools manually. It's strange, I didn't have to do this on any of the osd/mon/admin machines, and they're running the same OS, same version of python, same steps to install ceph, not sure why the client was such a jerk about it. Only other difference with client is that it's 32-bit.

"ceph-deploy admin ceph-client" ran fine

Back to using Ceph

Well, with the errors getting the client setup fixed, back to trying to setup a block device.

"rbd create TestBD --size 10000" Should create a ~10GB block device, runs fine.
"sudo rbd map TestBD --pool rbd --name client.admin" - should map it, does not run fine; get the following errors
  • ERROR: modinfo: could not find module rbd
  • FATAL: Module rbd not found.
  • rbd: modprobe rbd failed! (256)

What does this mean? not a clue.

...

Looking through various mail archives and other blog posts, it seems clear I'm missing a certain driver for rbd (Rados block device). Certain posts suggest that I install ceph-common to get this driver, but "ceph-common" is not a package on the EL6 repo -- apparently I should have done this on ubuntu, it seems to be what most of this is written for.

So looking at the ceph packages I have available to me (assuming the driver is in one of them, which it may not be) I can install: "ceph-deploy","ceph-devel","ceph-fuse","ceph-libcephfs","ceph-radosgw". The descriptions of these from "yum search ceph" aren't much help. I'm going to try devel and libcephfs first, those sound promising.

 ...

 Nope, no help. Yum search for rbd also returns nothing useful.

...

Evidently this is a kernel driver that I didn't compile into my kernel... So that's fun...

Recompiling my Client Kernel again

So, I'm not going to bother updating the kernel on the osd/mon machines, just the client - I don't think the others need it. And in-fact, there are a lot of warnings about not using rbd on osd devices. Whether this means you shouldn't actively use rbd, or that it is dangerous to have installed at all, isn't clear.

So, I got back to the extracted kernel, and run "make menuconfig". Under "Device Drivers > Block Devices" I find "Rados block device (RBD)" I'm not entirely sure if I should include this or modularize it, mostly because I'm not sure what the difference between the two is. To Google!.... Seems to be difference of loading it in the base kernel (loading at boot, with no ability to remove it) vs loading it after boot via modprobe. I think I'll modularize it, since that seems to be what ceph is expecting based on the errors.

So now it looks like "<M> Rados block device (RBD)" time to compile the kernel again weeeeeee...

....

Kernel recompiled, rebooted, tried the "rbd map" command again aaaaaaaaaaaaaaaaaaaaaaand crashed. Sweet. I won't reproduce the entire stack trace here, but the word [libceph] is mentioned over and over.

One possibility found at this email archive, is that 3.6.11 kernel is too old. Because you know, THEIR FREAKING OS RECOMMENDATIONS PAGE DOESN'T SAY USE THE LATEST IN  3.6.X OR ANYTHING. Not that I'm bitter.

....

So I downloaded and compiled the latest kernel (3.15.1 at time of writing) but had some issues. Notably my network devices are not installed. Compile had issues finding a bunch of modules, so I'm assuming that was the issue. Debating between trying to fix the 3.15 kernel, or going to a slightly older one and seeing if that works.

Tried 3.12.22, same problem

....

So this is probably my inexperience with upgrading/compiling my own kernel showing. Apparently I copy the default CentOS config from the /boot directory to the unpacked kernel directory, then rename to .config, then use the menu config to add in the things I want. This means any configurations in the current kernel are copied over. Somehow this happened automatically when I upgraded from 2.6 to 3.6, but isn't happening now.

  •  make clean #Clean up the failed make
  • cp /boot/config-2.6.32-431-17.1.el6.i686 /tmp/linux-3.12.22
  • # May have forgotten to mention, the client is 32-bit because the laptop is super old
  • mv config-2.6.32-431-17.1.el6.i686 .config
  • make menuconfig
  • #Enable rbd driver
  • make
  • make modules_install install

Doing it this way there are only a few "could not find module" errors (aes_generic and mperf to be specific) -- I'm not sure what they are, but hopefully they're not too important.

Booted to 3.12.22, and my network is working now, this is good. Let us see if I can finally map the rbd device.


Sweet baby Jesus I think it worked.

  • sudo rbd map TestBD
  • sudo mkfs.ext4 /dev/rbd1
    #It didn't tell me this is what it mapped it as, just had to look for it
  • sudo mkdir /CephBlock1
  • sudo mount /dev/rbd1 /CephBlock1
  • cd /CephBlock1
  • sudo touch IMadeaBlockDevice.txt

Back to Using Ceph .... Again.


Yep, appears to be working time to test some throughput, just going to do a dd with various block sizes to test the write speed of the drive. Command used:

sudo dd if=/dev/zero of=/CephBlock1/ddspeedtest.txt bs=X count=Y oflag=direct

Vary X and Y to keep amount of data transferred mostly consistant, oflag direct should keep it from buffering the writes, giving a better idea of actual drive performance. Also, the laptop (despite being old) and all the Nodes have gigabit ethernet cards connected to a gigabit switch -- so this shouldn't be a bottle neck.

Ceph Block Device Write:
Speed Block Size (bs) Count Total Data
76 KB/s 4K 10000 41MB
614 KB/s 32K 1250 41MB
1.1 MB/s 64K 625 41MB
2.1 MB/s 128K 156 41MB
4.1 MB/s 256K 156 41MB
6.3 MB/s 512K 78 41MB
7.3 MB/s 1024K 39 41MB
7.9 MB/s 2048K 20 42MB
21.1 MB/s 4096K 10 42MB
31.0 MB/s 8192K 5 42MB
41.9 MB/s 16384K 3 50MB


So there's some number, don't tell us much without a comparison so lets run this against one of the drives directly rather than through ceph.

Not well, it turns out. Like, really not well.

Speed for direct-drive-write2

Block Size (KB) Count Data Speed (MB/s)
4 10000 41 19.4
32 1250 41 53.9
64 625 41 57.4
128 312 41 58.3
256 156 41 51.7
512 78 41 54.9
1024 39 41 58.3
2048 20 42 50.7
4096 10 42 53.5
8192 5 42 57.7
16384 3 50 53.8

Some quick math, that averages about 20% the speed with a range of .3% to 77%. Running the test a few more times indicates that the non-ceph test is a little more erratic. Except for the 4KB test, which is always lower (around 20MBs), the other vary back and forth between ~48 and ~61 MBs with no apparent pattern. So if we average that out excluding the 4KB, we're still only looking maybe 22% average -- assuming this even mixed block sized workload. So that's unfortunate that we're looking at such a massive performance hit using the ceph block device -- even if we assume large block size work loads (which may be a pretty big assumption), a ~30% performance impact is significant.

To see if  performance continued to scale with block size, I ran a test with bs=1G count=5.  Result was 25.7 MB/s, so apparently performance becomes parabolic at some point. For comparison the same 5GB all-zeros text file wrote at a rate of 57.6 MB/s directly, and transferred (via scp) between two nodes at an average rate of 44.1 MB/s between two nodes.

Final Thoughts for this Installment

So initial impressions of using Ceph are not good. It's about six-and-a-half pains in the back to get setup and once it's setup performance is suboptimal. I'm going to do a few more posts where I play around with the other functionality of ceph and test out things like CephFS and Object Gateway (alternatives to using the block device), and management (how to get manually added daemons into init.d script). I'm also looking to test out failover and high availability to see what happens to data if a node or two goes offline. I'd also like to look at doing some more in depth performance testing, in a more real-world environment, but I'll have to think up a way to do that. It'd be cool to see if I can find out what the bottleneck is; Clearly it's not network or HDD, could it be processing power, memory, inherent bottleneck in the software?

These will be saved for another time though, as once again this post has run (length and time wise) much longer than anticipated. I've also got a demo of Condusiv's V-Locity program I'm doing soon -- not really a competing product, beyond being about storage/IO -- so I may look at doing a "my experience with" post on that as well, so long as the reps I'm working with give me the OK.  Til next time.

PS. let me know if there's any flaws in the way I tested the storage here. I know it's not exactly scientific or robust, but as far as I can tell it's not a bad first-impressions type test.