Tuesday, October 21, 2014

PowerShell: Managing User Profiles on Remote Machines

 Introduction

If you're like me, you're a charismatic force of nature whom everyone loves unconditionally. However, if you're job is like mine, you may often finding yourself wanting to remotely remove a user's  profile cache from a machine.

Profile cache is something a roaming domain profile creates when a user first logs into a machine. This holds the user's "CURRENT_USER" registry hive (ntuser.dat) as well as any profile folders that are not being redirected. If you're domain is setup to download redirected folders (offline file sync) it also stores those.

Why would you want to remove that? Various reasons. My usual use case is that one of my terminal servers (RDS) is running out of disk space. We have upwards of several thousand students that could potentially use our servers, but usually only a few hundred will use them in a given week. This means provisioning out enough space to hold several thousand user profiles (this would be several TB) would be too expensive and would never be necessary with proper profile management. Other reasons to remove local caches are: Corrupt profile, Old/bad settings stored in appdata, long login/other login problems, possibly security (but only if password caching is enabled) .

So how do we clear them. Windows has two native ways to do this, manually through the system properties interface, or through group policy using the "remove profiles older than x days" GPO. The first method works, but is limited; you must be logged into the machine (locally or rpd), you can only delete one profile at a time -- it taking several seconds to remove each profile, this method takes a really long time to remove a large number of profiles -- and the "manage user profiles" window can take a really long time ( 30 minutes or more) to actually open when there are hundreds of profiles on the machine. The GPO method is a bit better, but has two major caveats. First, the machine must be rebooted for the purge to run -- this means it cannot be done on demand when the system is in use -- and it cannot target specific accounts.

So we have a situation in which no real tool exists to do what we want. DelProf2 is one option I've looked at before, and while I'm sure it's a perfectly functional tool, as a rule I don't like running software to automate tasks when I don't know exactly what it's doing (from their change log it appears to be a very manual process). So from the short-comings methods above we want the following features
  • Ability to delete multiple profiles quickly
  • No reboot required
  • Command line for batching/automation
  • Ability to target specific profiles, or delete all older profiles
  • Fully remove all registry info and files of user's account
  • Do not have to be logged in / can be done remotely
Turns out all of this can be done via PowerShell and WMI.

 Enter PowerShell

First things first, here is the full code for the three different functions so you can follow along. They should be pretty easy to read, they're commented and have full get-help integration (except the first one, because it's very basic).


edit (7/17/15): Moved code to a local page. There was some issue with pastebin.
http://bisbd.blogspot.com/2015/07/code-dump-get-userprofile-remove.html

The last one I've already talked about on a previous post. It's a simple function to convert UTC time strings into a datetime object that PowerShell understands.

Lets look at the next one then.

Get-UserProfile.

I'm going to skip over the get-help information, as doing so would be redundant. So the first bit of code is this:

    [CmdletBinding()] 
      param( 
     [Parameter(Mandatory=$False)][string]$UserID="%",
     [Parameter(Mandatory=$False)][string]$Computer="LocalHost",
     [Parameter(Mandatory=$False)][switch]$ExcludeSystemAccounts,
     [Parameter(Mandatory=$False)][switch]$OnlyLoaded,
     [Parameter(Mandatory=$False)][switch]$ExcludeLoaded,
     [Parameter(Mandatory=$False)][datetime]$OlderThan   
     
    )

These are our Cmdlet bindings. They allow us to pass parameters cleanly to the function. Notice nothing here is mandatory. If no parameters are passed to the function it defaults to return all profiles on the local system. There UserID and Computer parameters are given default values if none is specified ('%' is WMI speak for "all" -- analogous to '*' or '.*' in most regex systems)

Next:

if(!(Get-Command Convert-UTCtoDateTime -ErrorAction SilentlyContinue)){
    write-host -BackgroundColor "Black" -ForegroundColor "Red" "################################################################################"
    write-host -BackgroundColor "Black" -ForegroundColor "Red" "#                                                                               "
    write-host -BackgroundColor "Black" -ForegroundColor "Red" "This Program Requires cmdlet ""Convert-UTCtoDateTime""                          "
    write-host -BackgroundColor "Black" -ForegroundColor "Red" "Find it here:                                                                   "
    write-host -BackgroundColor "Black" -ForegroundColor "Red" "http://bisbd.blogspot.com/2014/10/adventures-in-powershell-converting-utc.html  "
    write-host -BackgroundColor "Black" -ForegroundColor "Red" "#                                                                               "
    write-host -BackgroundColor "Black" -ForegroundColor "Red" "################################################################################"
    break;
}

Here we check to make sure our dependent function is loaded. If not, it displays this lovely warning with a link.
Next:
if($Computer.ToLower() -eq "localhost"){
    
    
    $Return = Get-WmiObject -Query "Select * from win32_userprofile where LocalPath like '%\\$UserID'" 
    

}
else{
    $Return = get-wmiobject -ComputerName $Computer -Query "Select * from win32_userprofile where LocalPath like '%\\$UserID'" 
}

OK, first real code. here we do a quick check to see if we're running on the localhost or against a remote machine. The WMI query we run is the same for both, but the parameters we pass to the get-wmiobject function are slightly different.

On my machine doing -ComputerName $Computer actually works with 'localhsot', but only because 'localhost' is defined as 127.0.0.1 in the default hosts file. This might be a safe assumption to make on any windows system, but I try not to make assumptions when I can.

About the WMI query. For anyone familar with SQL this will probably look pretty familiar, the terminology is a bit different though.  We're selecting everything from the class "win32_userprofile" where the class "localpath" matches our UserID with a backslash in front of it. The backslash is there to prevent UIDs which are a substring of another UID from returning erroneous results. for example: if you had users 'bob' and 'jimbob' searching for 'bob' would return both bob and jimbob without the slash in front.

Here's a command to run to see all what is returned by this query.

Get-WmiObject -Query "Select * from win32_userprofile"

What you'll see, if this runs correctly, is a mess. There's only a few properties in here that are useful (which we'll filter out in a minute). A noticeable exclusion here, you'll notice, is that there is no Username type field. I just wanted to point this out here because it might seems strange to be looking at the LocalPath property otherwise.

So after this block of code we have a full list of user profiles stored in our "$Return" variable. Now it's time for some filtering.

#Filter System Accounts
if($ExcludeSystemAccounts){
    $Return = $Return | Where-Object -Property Special -eq $False
}
#Filter out Loaded Accounts
if($ExcludeLoaded){
    $Return = $Return | Where-Object -Property Loaded -eq $False
}
#Filter otherthan loaded accounts
if($OnlyLoaded){
    $Return = $Return | Where-Object -Property Loaded -eq $True
}


Here are the first three filters. They're all pretty much the same. First they check if they're switch has been set, then use where-object to filter out certain properties. I do these all as individual if statements that modify the $return variable so that they can be chained together.

The two properties we're looking at here are "Special" and "Loaded". The Special property tells us if the account if an account is a non-user (i.e. system) account. You'll see things like "system", "network service", etc. listed as special. The "Loaded" property tells us if the account is currently in use. This property will be important later as you can't remove accounts that are currently loaded.

My inclusion of a "OnlyLoaded" flag might seem strange here. This is not directly related to the removal of user accounts, but an additional functionality. Combine "-OnlyLoaded" and "-ExcludeSystemAccounts" and you can find out what user(s) is(are) logged into the machine. Neat!

Let's look at the last filter now.
#Filter on lastusetime
if([bool]$OlderThan){
$Return | Where-Object -property LastUseTime -eq $Null | % {Write-Host -BackgroundColor "Black" -ForegroundColor "Yellow" $_.LocalPath " Has no 'LastUseTime', omitting" }
$Return = $Return | Where-Object -property LastUseTime -ne $Null
$Return = $Return | Where-Object {$(Convert-UTCtoDateTime $_.LastUseTime -ToLocal) -lt $OlderThan }
}

This one has a bit more going on.

The if statement looks a bit different. Because the variable is a "system.datetime" object rather than a boolean, I'm typecasting it as a boolean. If the variable has been populated, this returns true, if the variable is $Null (that is, was not set), then it returns false. The type casting isn't strictly necessary simply doing "if($OlderThan)" would return the same thing. This is mostly just for readability.

The next lines warn the user that it's skipping over any user accounts with a $Null "lastusttime" property. This is one aspect that may need to be modified in the future, but I don't think so. I have never seen a $Null LastUseTime on an actual user account. Mostly it shows up on accounts created by programs. For example, my computer has ".Net v4.5 Classic", "DefaultAppPool",  and ".Net v4.5" as accounts with no lastusetime. Even local users who have been created, but never logged in, won't get caught by this; this is because they don't show up at all until their first logon, at which point they'll get a "lastusetime".

Finally, after filtering out the $Null entries, we convert the LastUseTime to a datetime object and compare it to the datetime passed to the -OlderThan parameter. By default the lastusetime is a very ugly string that is difficult to make sense of at a glance. More importantly, to do any sort of date math, powershell needs it in a datetime object. So this is where the Convert-UTCtoDateTime function comes in to play. This function takes the ugly UTC string and turns it into something powershell can understand.

One caveat here, the -ToLocal flag turns out to be important. When doing date math, powershell evidantly doesn't take time zones into consideration. So it is necessary to have both dates be in local time before doing math, otherwise it might not behave as expected. See the following example:



Next, and final block:

if($PSBoundParameters['Verbose'])
{
Write-Output $Return
}
else{
 Write-Output $Return | Select SID,LocalPath,@{Label="Last Use Time";Expression={Convert-UTCtoDateTime $_.LastUseTime -ToLocal}}    
}


Here we process our output. I've added support here for the powershell verbose flag. -Verbose is a native flag in powershell, which all Cmdlets have, even if they don't implement them. Normally this is used with the Write-Verbose Cmdlet, but I've done a little more. I didn't just want to write something additional when verbose is set, but wanted it formatted differently, that is, unformatted. This is necessary for this Cmdlets integration with the Remove-UserProfile Cmdlet. So I check to see if verbose has been set, if it has simply return the $Return variable. If it is not set by verbose, I format the output to look nice and show only the relevant information.

The relevant information here is the SID (Profile unique identifier), the LocalPath (c:\users\myuser), and the LastUseTime. The LastUseTime I modify with Convert-UTCtoDateTime to make it look nicer and be more useful at a glance.

That's about all there is to Get-UserProfile. Next We'll look at Remove-UserProfile, which uses Get-UserProfile.

Remove-UserProfile

Remove-UserProfile is very much an extension of Get-UserProfile. At a high level, it uses Get-UserProfile to obtain a list of user profiles then deletes them. That's really about it. Obviously there's a few checks and things in here as well, so lets go through that.

$ProfileList = Get-UserProfile -Verbose -UserID $UserID -Computer $Computer -ExcludeSystemAccounts -OlderThan $OlderThan

Since we've already done all the big work in the Get-UserProfile Cmdlet, all we need to do is call it with the appropriate flags. We use verbose so we get the full object, not just the filtered information. We exclude system accounts because we don't want to delete those for what I hope are obvious reasons -- I'm not sure that it'd actually let you, but better to be safe. We also use the -OlderThan flag regardless of whether the user has actually specified this.

Looking back at the parameter bindings, you see I've included a default value for $OlderThan that is one day in the future. This is for a couple of reasons. First, it's way more readable, no nested if statements with different querys. Second, this filters out the not-system-but-also-not-user accounts. I haven't tried removing these accounts to see what would actually happen, but I'm sure .net would be none too happy about it.

Next block

    if(!$ProfileList){
        Write-Warning "NO USER PROFILES WERE FOUND"
        RETURN;
    }

This is a simple $Null check to make sure the query actually returned something. If no profiles matched the criteria, the script exits.


    if(!$Batch){
        Write-Warning "ABOUT TO REMOVE THE FOLLOWING USER ACCOUNTS"
        Foreach($User in $ProfileList){
            $User | Select SID,LocalPath,@{Label="Last Use Time";Expression={Convert-UTCtoDateTime $_.LastUseTime -ToLocal}}
        }
        $Title = "PROCEED?"
        $Message = "ARE YOU SURE YOU WANT TO REMOVE THE LISTED USER ACCOUNTS?"
        $Yes = New-Object System.Management.Automation.Host.ChoiceDescription "&Yes","Removes User Accounts"
        $No = New-Object System.Management.Automation.Host.ChoiceDescription "&No","Exits Script, No Changes Will be Made"
        $options = [System.Management.Automation.Host.ChoiceDescription[]]($yes, $no)
        $result = $host.ui.PromptForChoice($title, $message, $options, 1) 
        switch ($result)
        {
            0 {}
            1 {return;}
        }

    }

The next bit here is a confirmation dialog. This is a built in PowerShell feature you can read more about here, but a few quick notes about my implementation. First, if the -Batch flag is set, it skips this. This is important as otherwise the script would always require user confirmation which would make it far less useful from an automation standpoint.

The foreach loop here lists out (in a nice format) all the user profiles to be deleted. This is a nice sanity check for the user to make sure they know what they're deleting.

On the choices, "$yes"/0 does nothing, and $no/1 exits the script, with the default being no. I wrote it this way to make the coding easier. With this continue/exit method, the rest of the Cmdlet doesn't have to be imbedded within the "switch($result)" block; which makes the mode much more readable and the -Batch code easier to write.


    Foreach($User in $ProfileList){
        if($User.Loaded){
            if(!$Batch){
            Write-Host -BackgroundColor "Black" -ForegroundColor "Red" "User Account " $User.LocalPath "is Currently in user on" $Computer ":`tSkipping"
            }
            else{
            Write-Output "User $($User.LocalPath) on $($Computer) was in use and could not be removed"
            }
            continue;
        }
        if(!$Batch){
        Write-Host -BackgroundColor "Blue" -ForegroundColor "Green" "Removing User $($UserID.LocalPath) from $($Computer)"
        }
        else{
        Echo "Deleting $($User.LocalPath) from $($Computer)"
        }
        $User.delete()


    }

Now we get into the actual deleting. A simple foreach loop that deletes everything that was returned by the Get-UserProfile Cmdlet. A few things to look at in here. I use if(!$Batch) in couple places. This is for formatting reasons. The only difference between the batch and non-batch output is the method of writing. Batch uses Write-Ouput (aka echo) which is nice because it can be redirected to a log file. However Write-Ouput lacks a lot of formatting options. So in non-batch mode I use Write-Host, which cannot be redirected to a file, but gives us some formatting/coloring options to make the output more readable.

Next, lets look at the if($User.Loaded). As discussed in Get-UserProfile, the loaded property tells us whether or not the profile is currently in use. It's important to filter these out otherwise PowerShell will throw errors when you try to delete the profile. Why not use the -ExcludeLoaded flag we created in Get-UserProfile? I debated about this for awhile actually, but decided it would be frustrating if you were trying to delete a specific profile and the script kept saying "no profiles found". This way provides more information, even if it wastes a bit more time.

And lastly, we delete the profile. "$User.Delete()" is really all it takes.

These three functions are about 250 lines all together. And you could get all the same functionality in this.

  Get-WmiObject -Computer MyComputer.mydomain -Query "Select * from win32_userprofie where LocalPath like '%\\MyUser'" | % {$_.Delete()}

Not entirely sure my aim in pointing this out. Maybe that there's a tradeoff between writing something you know, and something other people could use?

Anyway, hope someone else can get some use out of this. I know it's something that's bugged me for a long time. 

Wednesday, October 15, 2014

Adventures in PowerShell: Converting UTC to DateTIme

Intro

Ran into a problem recently. I'm working on script (more on this later) that will let me pull a list of user profiles on a remote machine. The problem is that the user profile's "last use time" a bit of information I would like to have is in UTC format System.String object -- meaning it looks like this:

20141015160319.191919+000
Pretty standard stuff. Except PowerShell doesn't know what to do with it. It blew my mind when I learned this. PowerShell doesn't have a native function to convert this type of string to a date (that is, a 'System.DateTime' object. So after an hour of Googling I couldn't find a good go-to script to handle it (at least, not one in PowerShell, several in VB or C#). So I wrote my own. It's a pretty simple script, it relies on it being in the specific format above, that is:

yyyyMMddhhmmss.ffffffzzz
 I may update this script if I find additional formats -- but being a standard, this should work in most places.

One note is that while the format specifies down 6 decimal places, windows can only handle 3, so the trailing 3 are dropped.

This will convert to your local timezone with the '-ToLocal' flag


The Code:

function Convert-UTCtoDateTime{
<#

  Author: Keith Ballou
  Date: 10/15/14

#>

    #Parameter Binding
    [CmdletBinding()]
    param(
        [Parameter(Mandatory=$True,Position=1)][string]$UTC,
        [Parameter(Mandatory=$false)][switch]$ToLocal
        )

    #Breakout the various portions of the time with substring
    #This is very inelegant, and UTC
    $yyyy = $UTC.substring(0,4)
    $M = $UTC.substring(4,2)
    $dd = $UTC.substring(6,2)
    $hh = $UTC.substring(8,2)
    $mm = $UTC.substring(10,2)
    $ss = $UTC.substring(12,2)
    $fff = $UTC.substring(15,3)
    $zzz = $UTC.substring(22,3)

    #If local, add the UTC offset returned by get-date
    if($ToLocal){
    (get-date -Year $yyyy -Month $M -Day $dd -Hour $hh -Minute $mm -Second $ss -Millisecond $fff) + (get-date -format "zzz")
    }
    #else just return the UTC time
    else{
    get-date -Year $yyyy -Month $M -Day $dd -Hour $hh -Minute $mm -Second $ss -Millisecond $fff
    }
}

Tuesday, September 30, 2014

More Printing Problems - Spooler and Citrix Print Manager Crash

Solution 2

 Most of Solution 1 below still applies. Found that a large percentage of crashes/hangs can be avoided by making sure there are no old drivers on the Terminal Server or Print Server. You can check out a separate, related post here for details.

In short, the spooler service will load drivers, even if no mapped/installed printer is using them. This can cause old out of date drivers to crash/hang the spooler even if they are not being used.

Again, the problem is not 100% fixed, but is much better after clearing out drivers.

Solution

Today the solution appears to be use HP UPD (I'm using 5.8, for the record) and recreate print queues (fancy name for shared printers) and have a nice script to handle failures elegantly.

Use of the HP Universal Print Driver is pretty much mandated for use in a terminal server environment (check out the HP compatibility list here (pdf)). Most HP printers new/old do not support the use of their device specific driver in a Citrix Xen* environment. 

Recreating the print queues apparently helps for not-quite-adequately-explained reasons. Apparently switching which driver a print queue is using can cause some sort of corruption that can crash the spooler. So you have to delete and re-add the shared printer (you can use the same port, etc.) but starting with the UPD driver, rather than using the device-specific one and changing later. I've done this for a couple of printers and it has already drastically reduced the number of crashes, will be recreating other printers that seem to be causing problems and will update this will progress.

Having a script that handles failures nicely is key to reducing user impact. Printers are always problematic, especially in a Xen/RDS environment, so expecting 0 crashes is probably over optimistic. I've written a script that runs when a service is detected to have failed (I do this through the 'recovery' tab in the service properties). The script restarts both the spooler and the Citrix Print Manager Service. For some reason, if these two are not restarted together, they don't seem to talk to each other. So whenever one fails, both need to be restarted. I'll put the script at the bottom of the article.


Introduction


We've had more problems with printing since fixing the issue with the Citrix Print Client. The issue now is that the print spooler on our terminal server keeps crashing and it causes people not to be able to print, printers not to map, etc.

In brief, there are two services "Citrix Print Management" (CpSVC) and "Print Spooler" (spooler). Even though we are no longer using the Citrix Print Client, we still the the CpSVC service because it handles the mapping of printers through Citrix Group Policy. The Citrix Group policy gives us some additional functionality when mapping printers that would be difficult to replicate in normal AD group policy. Anyway, when either of those services crash it breaks everything; Simply setting the service to restart on crash doesn't work either. The processes must be restarted together, otherwise they don't seem to talk to each other.

I've written a small script that reboots both services whenever one fails, which minimizes the impact of a failure, but I'm still working on solving the underlying problem. So it's time for another long rambling post trying to figure out what's happening, the last one went pretty well, so let's give it a shot.

Environment

XenDesktop Controller: Server 2008r2SP1,  XenDesktop 7.0, Physical Server with way more resources than necessary
XenDesktop Hosted Desktop: Server 2008r2SP1, runs RDS server/ XenDesktop 7.1, clients connect from Wyse Xenith2 thin clients, about 100 possible clients, but generally have only 30-50 at any given time.
Print Server: Server 2008R2SP1, Microsoft Print Server

Errors in Event Viewer

Here's a brief rundown of the various errors I've gotten and what I've been able to find out about each.

splwow64.exe crash

Type: Error
Source: Application Error
EventID: 1000
Task Category: (100)
Message:
Faulting application name: splwow64.exe, version: 6.1.7601.17777, time stamp: 0x4f35fbfe
Faulting module name: ntdll.dll, version: 6.1.7601.18247, time stamp: 0x521eaf24
Exception code: 0xc0000374
Fault offset: 0x00000000000c4102
Faulting process id: 0xeb30
Faulting application start time: 0x01cfd83a3b2e4f74
Faulting application path: C:\Windows\splwow64.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll

 splwow64.exe is a process that translates the x64 print drivers for use by 32-bit applications. E.G. All of our print drivers must be 64-bit, because it's running 2008R2SP1, but the server runs 32-bit Office (for various plug-in compatibility). When office wants to print it has to go through splwow64.exe because it wouldn't know what to do with a 64-bit driver.

As for why this crashes, I have no idea. You see the "faulting module" is one "ntdll.dll", and the error code is "0xc0000374". ntdll.dll is explained here in Wikipedia, I'd tried to summarize but since my understanding is vague at best, probably best to read it yourself. "0xc0000374" is an error code that indicates "Heap Corruption", which is a fancy way of saying the memory was modified in a way that wasn't expected. Neither of these bits of information are particularly insightful, but they come up over and over in these errors.


spoolsv.exe crash

Type: Error
Source: Application Error
EventID: 1000
Task Category: (100)
Message: 
Faulting application name: spoolsv.exe, version: 6.1.7601.17777, time stamp: 0x4f35fc1d
Faulting module name: ntdll.dll, version: 6.1.7601.18247, time stamp: 0x521eaf24
Exception code: 0xc0000374 OR 0xc0000005
Fault offset: 0x00000000000c4102
Faulting process id: 0x9044
Faulting application start time: 0x01cfd8232b50fcd7
Faulting application path: C:\Windows\System32\spoolsv.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
This error is very similar to the splwow64.exe one above, with the exception of the "0xc0000005" variant. This, as far as I can tell, is a memory access violation.

Couldn't Load Print Processor

Type: Error
Source: PrintService
EventID: 365
Task Category: Initializing a print Processor
Message:
Windows could not load print processor hpcpp160 because EnumDatatypes failed. Error code 126. Module: 18\hpcpp160.dll. Please obtain and install a new version of the driver from the manufacturer (if available), or choose an alternate driver that works with this print device.

This error is a bit more informative. There are variants for other print processors (e.g. hpzppwn7). hpcpp160 happens to be the HP universal print driver (version 5.8). Anyway it's our first indication of something wrong with the print service. Problem is, most of the time this print processor has no problems. Most of our printers use the HP UPD, and they work most of the time.

I've also tried reinstalling the UPD (on the client, on the server would require more extend downtime). This hasn't had an effect.

Citrix -- "Environment is incorrect" "no printers were found" "printer auto-creation failed"

Type: Error
EventID: 1114 / 1116

I'm not going to list out the full text of these because they are erroneous (at least as far as my problem goes) -- see this forum for more info. Basically these errors can be logged even if printer creation is succeeding. It's very obnoxious. It's possible it's logged because printers are not being deleted at logoff, but I haven't found anything to suggest that is still an issue (that forum post is pretty old).

Print Spooler Can't copy file

note: you must enable the PrintService operation log to see this error. In event viewer find it under Applications and Services > Microsoft > Windows > PrintService. Right-click the operational log, select "enable log"

Type: Error
Source: PrintService
EventID: 811
Task Category: Executing a file operation
Message
 The print spooler failed to move the file C:\Windows\system32\spool\PRTPROCS\x64\hpcpp160.dll to C:\Windows\system32\spool\PRTPROCS\x64\202_hpcpp160.dll, error code 0xb7. See the event user data for context information.

This message again may vary with the print processor (replace hpcpp160.dll with whatever.dll). This is an odd message. The folder it references is full of duplicate print processor dlls (1_hpcpp160.dll to 499_hpcpp160.dll). I have no idea why, this is the current lead I'm working on.

Things I've Tried

  1. Create Script that restarts processes
    1. First did this restarting "spooler" and "cpsvc" every 5 minutes
      1. this technically worked, but caused some strange behavior and is over inelegant
    2. Set "spooler" and "cpnsvc" to run the script when either crashes
      1. this can be done in the services MMC snap in.
      2. Still doesn't solve the underlying problem, but is a nice band-aid fix until I can figure the bigger issue out
      3. it's also way more elegant that the "restart every 5  minutes" solution.
      4. Note: had to change CpSvc to log on as a local service with permission to interact with desktop (was just local service), otherwise the script wouldn't run correctly when it failed.
  2. Moving printers to the HP UPD
    1. Thought here is that one of the device-specific print drivers wasn't terminal compatible
    2. This hasn't exactly panned out. Moved device-specific printers to the UPD but errors continue to show up. I've just finished this migration recently, so maybe it'll pay off over time.
  3. Clearing out system32\spool\prtproc\x64
    1. This folder is full of duplicate .dll files (see "print spooler can't copy file" above) 
    2. Found out not to delete everything from that folder. The WinPrint.dll file will not recreate itself.
    3. Print spooler still crashed (0xc0000005). at like 5am when no one would have been using it. So that's fun.
      1. Actually, someone did log it at 5am, just seconds before the crash. So that's something to go on maybe.
    4. Printing hasn't gotten any worse, so at least I haven't broken anything
    5. This has at least seems to have stopped the 811 errors. Watching to see if the prtproc folder starts to build up again.
  4. Recreating Print Queue
    1. Some things I read suggested that print queues my become corrupt when switching drivers/print processors.
      1. Print queue is the technical term for a shared printer
    2. So I deleted and recreated the print queues that seemed to be causing the most issues
      1. "most issues" was determined by cross referencing the crash time with our user tracking to determine which stations (and thus which printers) were most recently logged into before the crash.
    3. This actually appears to have had some effect. I've only had one crash (and it was the splwow64.exe crash, not the spooler or CpSvc) today.
      1. Today's Friday, so load is low, but will continue to monitor.

Days Pass....
Spooler/CpSvc/splwow64 continue  to crash, but much more infrequently. Average maybe once or twice a day, much lower than the every couple of hours it used to be. I am going to continue to create print queues and see if I can eliminate the crashing all together and will update this page as I learn more.

RestartPrintServices.ps1

write-host "Shutting down Citrix Print Manager"
stop-service -force cpsvc
write-host "Waiting for CpSvc to shut down Gracefully" -nonewline
$count=0

while($(Get-service cpsvc).Status -ne "Stopped" )
{
$count++;

if($Count -gt 5)

{
    write-host ""
    write-host "CpSvc has not shutdown gracefully, shutting down manually"
    stop-process -force -Name cpsvc
    break;
}
write-host "." -nonewline
Start-Sleep 1
}
write-host ""

write-host "Shutting Down Print Spooler"
stop-service -force spooler
write-host "Waiting for Spooler to shut down Gracefully"

$count=0
while($(Get-service spooler).Status -ne "Stopped" )
{
$count++;

if($Count -gt 5)

{
    write-host ""
    write-host "spooler has not shutdown gracefully, shutting down manually"
    stop-process -force -Name spoolsv
    break;
}
write-host "." -nonewline
Start-Sleep 1
}
write-host ""


write-host "Bringing Spooler Back up"

start-service spooler

write-host "Bringing Citrix Print Manager back up"

start-service cpsvc

date >> c:\temp\restartprinters.txt

 

Wednesday, September 10, 2014

Printing issues on Win 7 Virtual machines with XenDesktop -- Citrix Universal Print Client

Solution

The  fix for this ends up being to remove the "Citrix Universal Print Client" from the XenDesktop clients. According to certain sources, this happens when the UPC is attempting to contact the Universal Print Server, even when a Universal Print Driver isn't being used. The server obviously doesn't respond, and so there's a considerable timeout before it falls back to windows printing.

I'm skeptical that this is actually what's happening, at least in my environment, for a couple of reasons.
  1. The timeout occurs even when using the UPS/UPD/UPC.
  2. The timeout occurs for some printers/drivers and not for others

The only way I've found to remove the UPC is to reinstall the XenDesktop Client without the UPC. From command line it looks something like:

#note: reboot before running the first command, and between each command. If you don't specify the "/noreboot" flag, it will reboot automatically after each command.
#<XenDesktopDir> = unzipped iso location
#<XDInstaller> = <XenDesktopDir>\x86\XenDesktop Setup\XenDesktopVdaSetup.exe
#   or for 64-bit = <XenDesktopDir>\x64\XenDesktop Setup\XenDesktopVdaSetup.exe

#Remove Current install
<XDInstaller> /quiet /removeall

#Reinstall Without UPC
#You can check here for the flags you need for any other customizations you make
<XDInstaller> /quiet /Components vda /EXCLUDE "Citrix Universal Print Client" /logpath "c:\temp\xdinstalllogs\"

#Configure Controller hostname/port
#If you set this through group policy, you shouldn't need this step
<XDInstaller> /quiet /reconfigure /controllers "mycontroller.mydomain.com" /portnumber 9999

 

Introduction

Been having some issues with printing from our thin clients. The most common symptom is that whatever program is trying to print (word, notepad, browser, appears to effect all about the same) locks up for 30-60 seconds. During this time the a window saying "connecting to printer" may be present (though not always), and the main program window appears unresponsive (not responding in task manger).

 I'm writing this as I troubleshoot so, apologies if it's a bit schizophrenic.

Setup

Client: Windows 7 x86 - fully updated -- mostly, some are not 100% updated, but all have at least SP1, and it doesn't appear to make much of a difference. Clients have XenDesktop Client installed (7.0/7.1 -- I end up trying 7.5 as well)

HyperVisor: XenServer 6.2.0 (fully updated, SP1 + a couple more updates). Clients all have XenTools installed.

Print Server: Server 2008R2 (VM) x64. Has "Print Manager Plus" software, also has Citrix Universal Print Server.

New Printer Server: Server 2012R2 (VM) x64. Does not have Print Manager Plus.

Drivers: To be discussed

A More Detailed Description of the Problem

My general view of the problem is that the printer is taking a long time to respond to the program trying to print to it. The main way this appears, is when you click "print" in a program, it tries to contact the printer to get availability, status, capabilities, etc. This takes a long time to finish when the problem occurs, which means when you click "print" the program stops working for 30-60 seconds. The window will show "not responding". Sometimes a box saying "connecting to printer" will appear, but not always.

This problem will happen several times when trying to print, because the computer appears to talk out to the printer several times. First when you go to the print menu to select a printer, then again when you click on the printer itself (to select it) then again when you click print to actually send the job. This means there can be a delay of several minutes for a user trying to print a basic document. This is understandably very annoying.

What I've tried so far

My first thought was that it had something to do with the Citrix Universal Print Server. That's why I built up the 2012R2 server. However the problem is present when mapping printers from that server. This only happens with some printers, other printers work just fine.

Mapping the printers directly (via IP on the client) appears to work fine as well.

Most of our printers are HP, and thus most use the HP Universal Print Driver. This is currently my main suspect, but it's unclear as to why that would cause such a problem with only VMs. Physical Clients (your standard desktop machines) do not have this problem.

I also thought it might be the universal print manager (Client side of the Citrix Universal Print Server), but I have disabled that service and deleted the Citrix Universal Print Driver from a machine and the problem persists. It's possible that the Citrix software still causes some problems, currently installing a vanilla windows 7 machine to test this theory.

Problem occurs whether connected over XenDesktop, Remote Desktop, or Directly through XenCenter Console.

Being an Administrator or  not does not appear to have an effect.

Using the FQDN or IP of the server rather than the hostname doesn't appear to have an effect.

I've tried Type 3 and Type 4 drivers (What does that mean?). Both word fine from the print server (that is the print server never has problems printing test pages). Type-4 drivers are not technically supported on Windows 7, so when a windows 7 machine trys to connect to a type-4 printer, they are given a "enhanced Point and Print Compatibility driver". These work fine, however this is not an apples-to-apples comparison, because there is no type-4 HP universal print driver. So the Type-3 not working where the Type-4 does is as much comparing device-specific to Universal as it is type-3 to type-4 (trying to find type-3 device specific drivers). But for what it's worth the HP UPD (type-3) has the problem, the device specific (type-4) do not.

Older version of the HP UPD appear to have the same problem. Nor does PCL 5 vs PCL 6 vs PS.

For some reason HP doesn't always have device-specific drivers on their website -- they'll just like the UPD. So it's really hard to find a printer that has the UPD, Type-3 Device Specific, and Type-4 Device Specific to do some real testing on. Testing the difference between type-3 DS and type-3 UPD at least..... Type-3 DS driver has the problem as well, at least on my "HP LaserJet 400 M401".

Problem also occurs on one of my terminal servers (2008R2),  also running XenDesktop client, but is a physical server.

Problem does not appear to occur on another terminal server (2008R2) which is a virtual machine, but is not running the XenDesktop Client. Doing further testing to verify. For all intents, problem does not exist on this machine. You can see the "connecting to printer..." dialog come up for a split second, but it's almost as fast as a physical machine, not enough to make a difference user experience wise. So it looks like we may be looking at the XenDesktop software as the culprit. It's strange that it only has problems with certain (mostly HP) drivers.

As I mentioned, I've also done tried using the Citrix UPD, but it shows the same problem. I'd have to do more testing to verify whether the CUPD locks up only when using it to print to printers that otherwise exhibit those symptoms, or whether the CUPD is just broken in general.

For now, I'm getting a clean windows 7 machine built up and will install software one-by-one to determine when the problem starts happening.

....

OK, fully updated windows 7 machine with nothing else on it is ready. My method here is
  1. Map the printer
  2. Wait a minute
  3. Open notepad
  4. Try to print
  5. Wait another minute
  6. Open Printer properties
  7. Unmap Printer
  8. Reboot after each full test (between changing variables, not between each printer)
If either 4 or 6 take more than about 10 seconds, I'll consider the problem to exist. I try this with three different printers, all of which have exhibited the problem in the past. Here are the variables and results of the tests.

  1. Clean machine
    1. Problem does not exist
  2. Install XenTools
    1. Problem does not exist
  3. Domain Join
    1. Problem does not exist
  4. Move to Correct OU
    1. Problem does not exist
  5. Install XenDesktop Client (VDA)
    1. Problem defiantly exists.
So I guess that settles it. It's something in the XenDesktop VDA that breaks printing. Keep in mind I did not run any of the XenDesktop Optimizations, so it's not one of those; the problem is within the client itself, or some change it makes without the option not to. So let's see if I can narrow down specifically what's causing it here.

  • This Forum suggests stopping "Net Driver HPZ12" service (some sort of HP monitor thing)
    • Did one better and disabled "Pml Driver HPZ12" as well (another HP monitoring thing)
      • This did not help
    • Lets try disabling the service and rebooting.
      • No Dice
  • This Post on experts exchange says to turn off bidirectional support in the printer properties. Lets try that.
    • It's under the "ports" tab in printer properties. I changed it on the server, then deleted/re-added it to the client.
    • This doesn't appear to have any effect
  • Downloading XenDesktop 7.5 -- just the VDA I'm not upgrading my whole installation yet. Since This "clean" machine isn't even joined to the XenDesktop Controller, I don't see how that would have any effect anyway.
    • Installed 7.5 VDA, doesn't appear to have had any effect.
    • Just to be sure I removed the device and uninstalled all drivers and tried again
      • Still locks up
  • Disabled Citrix Print Manager Service
    • I've tried this before, but thought I'd try again under 7.5
      • Still locks up
  • Disabled "Citrix Personal vDisk" Service
    • At this point I'm just disabling Citrix Services one-by-one to see if there's any change
    • Still locks up
  • Disabled "Citrix Profile Management" Service
    • Still Locks up
  • To reduce redundancy, disabled each "Citrix Service" one-by-one
    • Still Locks up
  • Tried disabling "allow direct connection to printers" in Citrix Policy.
    • This made it slightly less terrible. It still hangs for a bit while "connecting to printer" but the program doesn't stop responding (or at least, windows doesn't think it has). Not a perfect solution but it's progress at least.
  • Found this forum post, trying the solution listed at the bottom - installing the VDA without the universal printing component
    • Completely uninstalled current VDA first
      • Verified problem had gone away
    • Reinstalled using "XenDesktopVdaSetup.exe /components vda /EXCLUDE "Citrix Universal Print Client" /logpath "c:\ctxinstall.log" /quiet /noreboot"
      •  Sweet baby Jesus I think that actually worked
      • Yep, that appears to solve the problem
I've applied this fix to my main two XenDesktop Terminal Servers (aka ServerOS Hosted Desktops) and it appears to have solved the problem. You no longer get the "connecting to printer..." dialog, the program doesn't go to a not responding state, users will hopefully got breaking things by clicking a bunch of buttons while it appears to be frozen.

If the forum post I linked above is to be believed, the issue is that the XenDesktop client attempts to talk to the universal print server, even when the UPD is not being used. I'm a little skeptical that this is the entire problem, because this error would happen even when using the UPD/UPS. But at any rate it's fixed. Obviously this precludes using the Citrix Universal Print Server in the future, but it's honestly been such a pain to manage/get working that I have to call that a 100% positive effect.







    Friday, August 29, 2014

    IMacros for Firefox Failure Corde 0x80500001 (Error code: -1001)

    Solution

    The root issue is the encoding of the files. I've had this problem before with iMarcos, but it's never been quite this specific. Usually saving the datasource (the CSV file) as UTF-8 works. But some update to Firefox or iMacros has made it really inflexible. Both the datasource (.csv) AND the macro file (.iim) file must be saved as "UTF-8 with BOM". I Used Sublime Text to do this, but any full-featured word editor should work.

    For the record, various versions of things I use.
    Firefox : 29.0.1
    iMacros : 8.8.2
    Windows : 8.1 x64

    Problem/Full Story

    I often have to fill out web forms over and over again to perform certain tasks. A lot of these web forms are poorly designed at best, and don't support batch-type inputs. So having a program like iMacros is essential for me not wanting to kill myself while filling out a DHCP registration form 200 times.

    I've used iMacros for a number of years and never had too many problems; In Firefox at least, in Chrome the sandboxing makes it an exercise in keyboard snapping frustration to read/write to files -- but that's another story. However I needed to do a bunch of the previously mentioned DHCP registrations today (this is a system managed by another department, and the web interface is the only way to do it besides submitting a work request, which can take days), and found that the Macro/CSV I had previously used to do this were not working. I received the following error message:

    Error: Component returned failure code: 0x80500001 [nsIConverterInputStream.init], line 4 (Error code: -1001)
    I'd actually run into this error before, or at least one similar to it. iMacros (or possibly firefox), can be rather picky about the encoding it uses. Previously saving the .csv I use for inputs as UTF-8 had solved the problem. Today that didn't work though.

    After fiddling around with it for a bit, I found something strange. I Created a new macro (.iim) file to see if the other one was corrupt or something, but writing/saving it through Sublime Text (not the built in iMacros editor) as UTF-8, then opening it in the iMacros editor just showed a blank file. Strange, after trying a handful of different encodings for the macro file I found one that it would recognize "UTF-8 with BOM". After saving the file with this encoding through sublime, it would show up correctly in iMacros. However, I was still getting the same error when I tried to run it. Tried saving the csv file with the same "UTF-8 with BOM" encoding, and then it ran.

    Thursday, August 28, 2014

    Citrix Receiver for Mac "Cannot start the desktop ... OSStatus -1712"

    Solution

    In my case there were non-responsive processes on the mac client that were causing the problem. To resolve, I closed out of receiver and closed any active desktop connection. I then brought up the activity monitor (command+space to bring up the search, enter "activity monitor"). There were several Citrix processes, one non-responsive process with the name of the personal desktop that wouldn't load and a few helper processes. I force-quit all Citrix processes, then restarted the receiver client. Connected to desktop successfully at that point.

    It may not have been necessary to force-quit all Citrix processes, but it doesn't seem to have had any consequences, they started back up when I reloaded receiver.

    Problem / Full Story

    Had a user this morning that couldn't connect to their windows desktop over XenDesktop (7.1). User is one of our few Mac (running Mavericks) users, and uses XenDesktop to get to windows applications he needs. When he tried to log in this morning, he got the following error when he tried to connect to his windows 7 machine.

    Cannot start the desktop "Personal Desktop"
     Contact your help desk with this information: The application "Personal Desktop" could not be launched because a miscellaneous error occurred. (OSStatus -1712).
    Odd thing was, he was able to connect to his Windows 8 desktop just fine. So the connection to the server was working, as was the connection to at least one VM. The Win7 machine was showing up as registered and ready in Citrix Studio on the XenDesktop Controller. Win7 appeared to be responsive when interacting with it through XenCenter. I tried restarting the Windows 7 machine but the error persisted. A brief look through the longs on the Win7 machine and the XDC didn't show any errors, so it seemed like the problem wasn't server-side. Had the user logout/close receiver on his machine and reopen it, but the error continued to occur.

    Up in the user's office I brought up the activity monitor and saw the unresponsive process -- see "Solution" above. After killing and restarting all citrix processes the user was back up and running. Rebooting the Mac probably would've had a similar effect.
     

    Tuesday, August 12, 2014

    a security package specific error occurred - Security-Kerberos EventID 4

    Solution

    Root problem was that there were static DNS entries set for some computers whose IP addresses had changed. Deleting static entries and waiting for changes to propagate out solved the problem.

    Full Story

    Had an issue this morning where some new computers on our network were not getting printers mapped. This is not an uncommon occurrence, because printers, but the cause of the problem was a new one for me. These computers had just been upgraded (new hardware, same hostnames) and seemed to be functioning fine on the domain. The print driver was working fine on other machines, and the usual fix, restarting the print spooler, had no effect.

    Trying to access the Event Viewer on the lab machines I got the error "A Security Package Specific Error Occurred". This error (or a variation) came up trying to access the computer via any WMI / RPC / DCOM method.

    On the print server I had the following error, listed as Level:Error, Source:Security-Kerberos, Event-ID: 4

    The Kerberos client received a KRB_AP_ERR_MODIFIED error from the server MYLAB-04$. The target name used was cifs/MYLAB-02.My.Domain.Com. This indicates that the target server failed to decrypt the ticket provided by the client. This can occur when the target server principal name (SPN) is registered on an account other than the account the target service is using. Please ensure that the target SPN is registered on, and only registered on, the account used by the server. This error can also happen when the target service is using a different password for the target service account than what the Kerberos Key Distribution Center (KDC) has for the target service account. Please ensure that the service on the server and the KDC are both updated to use the current password. If the server name is not fully qualified, and the target domain (MY.Domain.Com) is different from the client domain (My.Domain.Com), check if there are identically named server accounts in these two domains, or use the fully-qualified name to identify the server.

     One thing jumped out here right away, the error is from lab computer 04 (SPN: MYLAB-04$), but the FQDN is listed as computer 02 (cifs/MYLAB-02.My.Domain.Com). So that set off some alarm bells, but I still did some additional research before jumping in.

    Supposedly this error can be caused by a number of things (a Google of "A Security Package Specific Error Occurred" returns about 6 difference causes on the first page of results). In my case, as mentioned above, was a DNS issue. While upgrading these lab machines, the IP addresses we assigned through DHCP changed slightly. Normally, we just let the machines register themselves with the DNS server after they pick up their IP via DHCP, we don't have many static DNS entries. For some reason, these machines had static entries, though, so our DNS server was resolving their hostname differently than AD was, which is what caused the authentication errors. Deleting the static entries and waiting (DNS changes can take a while to replicate) solved the problem.

    Thursday, July 24, 2014

    Samsung Galaxy Light - WiFi calling "called ended", WiFi messaging fails

    Resolution

    Update 9/24/14

    Been using for a bit now, haven't had many problems. Still have an occasional call that goes straight to voice-mail or a text that won't send until you reboot, but the problem is much less pronounced than before, and doesn't have the same repeatability that it had before.

    Writing this now because I just got a new firmware upgrade today. We'll see if it has any effect on the situation. That page says that it only enables "in flight texting", but it wouldn't surprise me if there were additional bugfixes included.

    --------------------------------------- Original Text -----------------------------------------------------


    In contrast to most of my posts, the solution to this is actually available pretty readily on the internet . I am posting this to confirm that it worked on my two phones, because people are pretty bad at updating forum posts once they've found a solution.

    The fix in this case is to run a firmware update via the "Samsung Kies" software. This firmware update is not available OTA. Testing over a ~24 hour period I have been unable to reproduce the problem, which was quite easily reproduced before the update.

    While researching this problem, I found many similar reports of other Galaxy phones (s3,s4,etc) having the same problem. None of those reports seemed to have a resolution, only the Galaxy Light thread had anyone confirm that they had fixed it. So, if you have another Galaxy phone with the same problem, I would check to see if there is a firmware update available via the Kies software.

    Installing and running Kies software

    1. Download the Kies software for your platform from Samsung's website
      1. There are two versions "Kies" and "Kies 3". For the Galaxy Light, you want "Kies", newer phones may need the "Kies 3"
      2. There is also a version for Mac, only one version though, not sure if that works for all phones or what.
    2. Install the Kies software.
      1. There isn't much to do here. Just click next a whole bunch really. Though I did opt to install the "universal drier tool", not sure if that's necessary. 
    3. Open the Kies software, you should see a "connect your device" type prompt. Connect your device.
    4. If this is the first time you've plugged in your phone to the computer it may take a few minutes to install drivers.
      1. Note: I did have one of my phones lock up while connecting it to the computer (screen and hardware buttons became unresponsive). A Force reboot fixed it.
      2. The device needs to be in MTP mode, not PTP mode. Kies will warn you if it's not. 
    5. When I hooked up my phones, it immediately prompted me to do the firmware upgrade.  If this doesn't happen for you, the "Basic Information" tab should show the current firmware and whether or not it's up to date.
    6. Follow Kies instructions to upgrade the firmware. 
      1. The first time I ran this, I let the phone go to sleep while Kies downloaded the firmware update. Since the download took a while (I don't have super fast internet at home) the phone disconnected from Kies and I had to start the process all over again. Had to sit there swiping the screen back and forth while the download happened to keep it from sleeping/disconnecting.
    7. Phone will reboot and install, don't turn it off or do anything to it while this happens.
      1. The upgrade performed just fine for me on both phones, with no loss of data. Still, if you have critical stuff on your phone (why do you have critical stuff on your phone?, keep that stuff somewhere less steal-able) maybe back up the phone (can be done with Kies (Backup/Restore tab), or about 1000 other things) before doing the upgrade, just in case.
    That's about it. Once the firmware update is complete you shouldn't have any more problems with WiFi Calls not dialing, not receiving calls on WiFi, or not being able to send/receive text messages on WiFi.

    In interest of full documentation, One call I made right after the upgrade had really poor call quality (sounded like I was underwater). This was one call and have not had the problem reoccur since. 

    The Problem / Full Story

    We recently switched to T-mobile because they're reasonably cheap (for what you get) and their business model is slightly less troublesome than most big carriers. <Rant>took about 8 hours on the phone over 2 days to get the plan set up properly because the original guy who sold us the plan didn't know what he was doing </Rant> . Anyway we brought our devices, because you can pick up the Galaxy Light on Amazon for dirt cheap.  It's not a high-end phone, but it's a reasonable spec with a fairly recent version of android (4.2.2). 

    But enough advertising, you wouldn't be here if you weren't having a problem. The phones worked fine for a few days, but we started having problems with the WiFi calling within about a week. WiFi calling was a big deal for us because all carriers (besides Verizon, but fuck them) have not-so-great coverage in my town, but we have WiFi pretty much everywhere we go. So problems with the WiFi calling was problems with our service in general.

    Problems were as follows. If the phone was allowed to sleep for awhile (seemed to be 30-45 minutes on average) WiFi calling would stop working. This means, with out a cell signal (which I don't get at work because my building is made out of concrete and florescent lighting) no calls could be made or received and no texts could be sent or received. Not that the phone was aware of this, the WiFi calling icon in the notification bar was still blue and it said it was making calls over WiFi. When you actually tried to place a call, however, it would immediately end the call (at 0:00 seconds) and call status would be "Call Ended". Looking at call history you would see "Canceled". Trying to send Texts would result in a "Failed to Send" message. People trying to call us would get send straight to voicemail, or occasionally one ring then voicemail. Then only way to restore service was to turn WiFi calling off-and-on (usually) or reboot the phone. This fix would only last until the phone went into some low sleep state (seemed to be ~30 min of screen off).

    After much internet searching and playing around with settings on the phone I came across this thread (also linked at top) about the WiFi calling on the Galaxy Light. A non-OTA firmware upgrade was available and fixed the problem (see above for steps).

    Here's some things I tried that did not work
    1. Clearing data (via application manager) from "WfcService" and "Wi-Fi Calling Settings"
    2. Changing Wi-Fi calling preferences (prefer cell, etc.)
    3. Different Wi-Fi networks
    4. Standing close to router (router is on my desk at home, so while sitting at desk phone is <5 feet from router)
    5. Turning off voice control
    6. Turning off other wireless radios (bluetooth, gps, NFC, etc.) 
    Here's some other things I've seen reports of people trying that have not worked
    1. New phone - This appears to be a problem with at least all Galaxy phones. Seen reports of people getting phones replaced 4 or more times without resolution. 
    2. Factory Reset
    3. Changing network mode (LTE/WCDMA/GSM, WCDMA/GSM, WCDMA only, GSM only)
    4. Changing Sim cards/upgrading sim cards
    5. opening ports on router


    Monday, July 21, 2014

    Condusiv V-Locity - setup and first thoughts

    Introduction

    I'm going to be pretty brief here because I feel I'm not going to have much to actually say about this piece of technology (note from the future: I wasn't able to get this to work in my environment, read on if you're interested in the problems I ran into, but otherwise this article probably isn't worth your time). We'll leave it at : managing Storage IO in a virtualized environment is a pain, so I've taken to investigating some technologies that look at improving storage performance without simply buying more storage devices. This post is written in steam of consciousness style as I go through the setup process. I try to document anything I notice and or am thinking during the install. I do some minimal editing afterwords, but for the most part it'll be a rambling mess.

    V-locity is a program from Condusiv, a name that was obviously dreamed up by someone with no respect for spoken language, ease of typing, or autocorrect. Here on I'll probably just refer to it as "the program." The idea behind the program is that windows file system driver is poorly optimized for an age of virtualization and non-local storage. Breaking file read/writes into multiple chunks isn't noticeable on local storage, but can add serious overhead when it has to go over Iscsi. So through a new driver an a bunch of caching, the program hopes to optimize storage to give you better density without buying more hardware (or increasing Cap ex, as they say) </marketing>. I won't go into much of the details of how it works here (I'm still a little fuzzy after a webinar and like 6 sales calls, and let me tell you it's not for lack of paying attention) if you're interested you can read all about it here.

    First let me say, Condusiv certainty isn't trying to save you any money over buying more hardware. We were quoted a price of around sixty thousand dollars (+ about seven thousand in yearly licensing) to run on our three servers (32 cores each, which is how it's sold). That's insane, that's roughly three times as much as the server's themselves cost. This more or less makes it only an option if you're out of rack space, or for whatever reason can't move your data to faster storage devices.

    Setup

    I've got a test environment setup. 50 VMs and a server. VMs are running on some Dell R610 servers connecting to their storage over a 6GBs Direct attach SaS link. Server are running XenServer 6.2.0 (sp1 + all other patches). VMs are 64 bit windows 7, all updates installed, basic office applications for testing. Tests will use XenDesktop to measure login performance (connecting via thin clients) and a more manual approach to measure some application launches (visual studio 2012 is one in particular we've had take a really long time to load on VMs due to excessive file system access during first-run)

    Setup is broken into three parts, the VMC (controller) and master node (velocity server), and the clients. Since this is a test setup, my VMC and master node are living on the same server. VMC setup is simple, just click next and it installs inself and a webserver to interface from. One thing, it doesn't tell you to access it via the web page. The installer just finishes and you have to figure it out. The setup instructions don't really say this directly either, you just have to kind of guess at it (I figured it out because the install directly had a bunch of .js, .html, etc. files).

    After that, the setup runs a discovery on your domain to find machines to install on. I didn't set any sort of filters on this, but it is currently stuck (about 20 minutes) on 740/742, we'll see if it ever finishes.

    ... 30 minute mark now, still spinning on 740/742.
    ... well over an hour now. Neither the "close" or "next" buttons do anything.
    ... two hours and no sign of movement. I'm about to go home, so I'll let it run overnight and reboot the dumb thing if it hasn't sorted itself out by morning.
    ...
    ...
    ...Still at the same spot, think it's safe to say it's stuck, going to try restarting the service. Now it says discovery complete, 1 record processed. Sounds legit. Looking through the machine list, it seems to have detected a fair amount of my machines, but none of the VMs I created specifically for testing.

    After another restart of the VMC service and some time it picked up all my servers, but I've run into a bigger issue. The master node component won't install on my virtual server. The server meets all of the requirements listed in the various install guides and readme files, but it doesn't show up on the list of machines available for deployment. Trying to run the installer manually gives the error "OS not supported".

    Looking further, it is only presenting the option to install the master node component to physical machines. I can't find this listed as a requirement anywhere, and the sales rep/tech people say that it isn't a requirement, but that's the only option it's giving me. 

    Worked briefly with the sales rep/tech support team that's been helping me, they gave some new licenses to  try, but for whatever reason the program still only gives me the option to install to physical servers. I don't have spare physical servers lying around, so we're a bit dead in the water.

    On a hunch I looked up Vlocity + XenServer (my hypervisor of choice), and have found some conflicting reports of support for the XenServer platform (PDF). At best it has partial support, and that possibly only for the guest/client. So maybe that's the issue. Looking back through my emails I defiantly mentioned that's what I was running on (and I'm pretty sure we covered that more in-depth during one of the 7-8 phone calls they made me sit through), but maybe I wasn't clear enough.

    So, unfortunately this is where it my review of  vlocity must end. I'd spend more time with their tech support troubleshooting it, but I have other projects that need my attention. So, take my experiences for what they're worth (probably not much) but if you're looking to evaluate and are using XenServer, maybe be sure you're clear with your reps about the setup.

    Edit: Sales rep assures me that Vlocity works "with Citrix" (I haven't gotten him to say "with XenServer") and in interest of objectivity I was able to get it to start seeing virtual machines. Still doesn't see the test server I built up as a valid install location, so I'm still stuck, but there you go.

    Thursday, July 10, 2014

    Excel Crash: Visual Studio (10) Tools for Offce Add-in -- vs10exceladaptor

    Solution

    Solution thus-far has just been to disable the add-in for all users. We don't know of anyone actively using this add-in so that works for us. If you need the add-in I would look towards compatibility. 0xC0000005 typically indicates that a program tried to access memory it's not allowed to. This could mean another plug-in isn't playing nice, or you might try disabling DEP (though this is a pain for office, and more than a bit of a security risk).

    To disable add-in for all users, found the best way was to log in as admin, find the excel executable (excel.exe) > right click > run as admin. Then go to File > Options > add-in > com add-ins > go. Then uncheck the boxe(s) for the "Visual Studio Tools for Office Design-Time Adaptor for Excel".

    Story

    Had some users complaining about excel crashing on our terminal server. This is a terminal (RDS) server that students use to remotely access lab applications via thin-clients, so it has just about every program under the sun installed on it. I mention this only because this means we have about 1000 different excel add-ins loading/available which is what I expect is causing the underlying issue. Also worth noting, Thin clients connect via XenDesktop (7.1); this could also be a cause of the error.



    Other notes on server: Server 2008R2 (fully updated, x64), Office 2013 x86

    Looking at the even logs, I see the excel crash (Error, Aplication Error, Event ID: 1000)

    Faulting application name: EXCEL.EXE, version: 15.0.4535.1507, time stamp: 0x52282875
    Faulting module name: EXCEL.EXE, version: 15.0.4535.1507, time stamp: 0x52282875
    Exception code: 0xc0000005
    Fault offset: 0x0005a802
    Faulting process id: 0x2380
    Faulting application start time: 0x01cf9c61a803a93c
    Faulting application path: C:\Program Files (x86)\Microsoft Office\Office15\EXCEL.EXE
    Faulting module path: C:\Program Files (x86)\Microsoft Office\Office15\EXCEL.EXE
    Report Id: ed04ac54-0854-11e4-9867-d4bed9f3434f
    Which doesn't give us much. In past experience, 0xC0000005 is a generic "Memory Access violation" error -- a program tried to access memory it didn't have permission to.  The next entry in the even log is a bit more useful (Error, Microsoft Office 15, EventID 2001)

    Microsoft Excel: Rejected Safe Mode action : Excel is running into problems with the 'vs10exceladaptor' add-in. If this keeps happening, disable this add-in and check for available updates. Do you want to disable it now?.
    This appears to be something that gets installed with visual studio, no idea what it does. I went ahead and disabled it for all users (see notes in Solution) since I'm not aware of anyone using that add-in. Worth noting that I initially tried disabling the add-in through the registry (HKLM\Software\Microsoft\Office\Excel\Addins\VS10ExcelAdaptor\ -- Set LoadBehavior to 0) but that didn't seem to have any effect.


    Thursday, June 19, 2014

    Creating A Ceph Storage Cluster using old desktop computers : Part 2

    So, in the last part I left off where I had a clean+active cluster with two OSDs (storage locations). No data has yet been created, and, indeed no methods of making the locations available to store data have been set up.

    Following along with the quick start guide the next thing to do is to expand the cluster to add more OSDs and monitoring daemons (I'm about half-way down under "expanding your cluster"). So away we go.

    Expanding Cluster

    Adding OSDs

    Adding a third OSD went just fine using:

    ceph-deploy osd prepare Node3:/ceph
    ceph-deploy osd activate --fs-type btrfs Node3:/ceph
    #For those of  you just joining us, I'm using btrfs because I can. Recommendation is typically to use xfs or ext4, since btrfs is experimental.

    After running those commands, running "ceph -s" shows cluster now has "3 up, 3 in" and is "clean+active". Available storage space has also increased significantly. 

    Adding a Metadata Server

    Next step is to add a metadata server which is used by CephFS. CephFS one option for presenting the Ceph cluster as a storage device to clients. There's not much to be said here I ran the command and it completed.

    ceph-deploy mds create Node3
    # I chose Node3 arbitrarily 


     Adding More Monitors

    So now we set up more monitors so that if one monitor goes down the entire cluster doesn't die. In the previous bit, I ran into an issue where the monitor service started creating very very very very verbose logs to the extent that it filled up my OS partition (several MB a second of logs). I was able to fix this with a change to the ceph.conf file, so I'm hoping that change gets carried between monitors, but I guess we'll see.

    ceph-deploy mon create Ceph-Admin Node2

    This didn't got as well. It installs the monitor on each node, but the monitor process does not start, and does not join the cluster. Some errors during install

    • No data was received after 7 seconds, disconnecting
    • admin_socket: exception getting command desciptions: [Errno 2] No such file or directory
    • failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i Node2 --pid-file  /var/run/ceph/mon.Node2.pid -c /etc/ceph/ceph.conf --cluster ceph '
    • Node2 is not defined in 'mon initial members'
    • monitor Node2 does not exist in monmap
    • neither public_addr nor public_network keys are defined for monitors
    • monitors may not be able to form quorum
     I found a very helpful blog post detailing resolution to many of these errors.

    First problem, my admin/deploy box had a bunch of hung create-keys processes. So I killed all those.

    rebooted the new monitor node, and the mon service started, but I can't seem to interact with the cluster at all now. That's probably not a good sign. All ceph commands time out even running on Node1.

    ....

    After much troubleshooting that went nowhere, I'm rebuilding the cluster. Uninstalling everything and purging all data. Reinstall is going pretty quick now that I know how everything works (ish). One thing I did find. I'm a bit more clear on the difference between

    ceph-deploy new
    ceph-deploy mon create-initial 

    Now. "new" actually creates the 'new' cluster. You have to specify monitor nodes though so I thought 'new' referred to new monitors. Anyway, I following all the previous steps again, trying not to take any shortcuts or anything so hopefully I'll end up right back at the point before I screwed everything up.
     
    ...

    New problem when trying to do "ceph-deploy osd activate" fails saying that it couldn't find a matching fsid. A bit of Googling suggests that there is data leftover from the first install (despite doing the purge+purgedata while I was remaking the cluster) so I'm reformatting the drive to see if that works.

     Yep, deleting and re-adding the partition reformatting worked. So purge data does not apparently actually purge data, at least not on my systems. Note: deleting and recreating partition reformatting required to edit /etc/fstab again to make sure mount worked correctly (UUID of the file system)

    ...

    Back to pre-expanded pool at "active+clean" another discovery made. The quick start guide tells you to add "osd pool default size = 2" to the ceph.conf file under "[default]". This is a lie, it goes under "[global]". That is why I had to go back and set the size on each pool last time in order to get the "active+clean" state.

    ...

    and the add monitors steps gave the same error

    • Starting Ceph mon.Node2 on Node2
    • failed: 'ulimit -n 32768; /usr/bin/ceph-mon -i Node2 --pidfile /var/run/ceph/mon.Node2.pid -c /etc/ceph/ceph.conf --cluster ceph '
    • Starting ceph-create-keys on Node2
    • No data was received after 7 seconds, disconnecting
    monitors do not start on nodes 2 or 3. I'm not going to try rebooting them this time and hope that it doesn't totally destroy my install again.

    ....

    Broke it again, this time trying to use "ceph mon add <host> <ip:port>" to manually add the monitor so that it would stop saying it wasn't in the monmap. This apparently is not the way to do that.

    Guess I have to reinstall everything again... joy

    ....

    Broke it a few more times but everything is working now with 3 monitors. For whatever reason ceph-deploy to add second/third monitor was not working at all. So I used this guide to manually add the monitors, which worked, except steps 6 and 7 are backwards, the monitor needs to be started before you run the "ceph add mon" command. "ceph add mon <etc>" will hang if the monitor you tell it to add does not respond, if you kill (ctrl+c) the "ceph add mon" command, that's when the whole cluster becomes unresponsive. You can technically run "Ceph add mon" then start the monitor on the node, but since "ceph add mon" takes over your shell, getting to the Node to start it can be problematic.

    So now I've got a cluster with 3 OSDs and 3 montitors, I've got a warning about the clocks being different between the monitors, but other than that it's working and happy. Manually set all the clocks but clock skew warning is still happening. Restarted one node and one warning went away. Trying to restart another node but creating new monitors the way I did means that didn't get put in /etc/init.d so I can't restart them via the "service" command. Trying to find how to add them to this.

    Giving up on that for now, may come back to it later.

    Finally Using Ceph


    Ok, while it's not exactly in prime condition, I want to get down the the functionality so the clock skew (it's a couple seconds) and the whole daemons not being in init.d problems I'm leaving for later.

    Going to use the laptop I set up as a proxy as the client, which means I need to update its kernel.

    ...

    Laptop Kernel updated, now using this guide to get the client and block device setup.

    Setup ssh keys, hosts file, ceph user + sudo access, ceph repo,

    More problems installing ceph

    Ceph-deploy install ceph-cleint -- for some reason installing this on the laptop has been much more difficult than on my other machines. Maybe because the laptop wasn't installed with minimal install? Here's a few things I've run into

    Repo errors - I sent up the ceph repo according the the pre-flight check guide, but kept getting 404 errors during the install. Looking at the ceph.repo file, ceph-deploy apparently adds in additional repos in addition to the ones set up via the guide, removing these and running ceph-deploy with the --no-adjust-repos flag fixed that. Don't know why ceph-deploy was adding bad repo urls

    Python Error after install - After ceph installs it trys to run "ceph --version" to verify install. But this failed with a python traceback error saying it couldn't find the module argparse. I ended up having to install argparse and setuptools manually. It's strange, I didn't have to do this on any of the osd/mon/admin machines, and they're running the same OS, same version of python, same steps to install ceph, not sure why the client was such a jerk about it. Only other difference with client is that it's 32-bit.

    "ceph-deploy admin ceph-client" ran fine

    Back to using Ceph

    Well, with the errors getting the client setup fixed, back to trying to setup a block device.

    "rbd create TestBD --size 10000" Should create a ~10GB block device, runs fine.
    "sudo rbd map TestBD --pool rbd --name client.admin" - should map it, does not run fine; get the following errors
    • ERROR: modinfo: could not find module rbd
    • FATAL: Module rbd not found.
    • rbd: modprobe rbd failed! (256)

    What does this mean? not a clue.

    ...

    Looking through various mail archives and other blog posts, it seems clear I'm missing a certain driver for rbd (Rados block device). Certain posts suggest that I install ceph-common to get this driver, but "ceph-common" is not a package on the EL6 repo -- apparently I should have done this on ubuntu, it seems to be what most of this is written for.

    So looking at the ceph packages I have available to me (assuming the driver is in one of them, which it may not be) I can install: "ceph-deploy","ceph-devel","ceph-fuse","ceph-libcephfs","ceph-radosgw". The descriptions of these from "yum search ceph" aren't much help. I'm going to try devel and libcephfs first, those sound promising.

     ...

     Nope, no help. Yum search for rbd also returns nothing useful.

    ...

    Evidently this is a kernel driver that I didn't compile into my kernel... So that's fun...

    Recompiling my Client Kernel again

    So, I'm not going to bother updating the kernel on the osd/mon machines, just the client - I don't think the others need it. And in-fact, there are a lot of warnings about not using rbd on osd devices. Whether this means you shouldn't actively use rbd, or that it is dangerous to have installed at all, isn't clear.

    So, I got back to the extracted kernel, and run "make menuconfig". Under "Device Drivers > Block Devices" I find "Rados block device (RBD)" I'm not entirely sure if I should include this or modularize it, mostly because I'm not sure what the difference between the two is. To Google!.... Seems to be difference of loading it in the base kernel (loading at boot, with no ability to remove it) vs loading it after boot via modprobe. I think I'll modularize it, since that seems to be what ceph is expecting based on the errors.

    So now it looks like "<M> Rados block device (RBD)" time to compile the kernel again weeeeeee...

    ....

    Kernel recompiled, rebooted, tried the "rbd map" command again aaaaaaaaaaaaaaaaaaaaaaand crashed. Sweet. I won't reproduce the entire stack trace here, but the word [libceph] is mentioned over and over.

    One possibility found at this email archive, is that 3.6.11 kernel is too old. Because you know, THEIR FREAKING OS RECOMMENDATIONS PAGE DOESN'T SAY USE THE LATEST IN  3.6.X OR ANYTHING. Not that I'm bitter.

    ....

    So I downloaded and compiled the latest kernel (3.15.1 at time of writing) but had some issues. Notably my network devices are not installed. Compile had issues finding a bunch of modules, so I'm assuming that was the issue. Debating between trying to fix the 3.15 kernel, or going to a slightly older one and seeing if that works.

    Tried 3.12.22, same problem

    ....

    So this is probably my inexperience with upgrading/compiling my own kernel showing. Apparently I copy the default CentOS config from the /boot directory to the unpacked kernel directory, then rename to .config, then use the menu config to add in the things I want. This means any configurations in the current kernel are copied over. Somehow this happened automatically when I upgraded from 2.6 to 3.6, but isn't happening now.

    •  make clean #Clean up the failed make
    • cp /boot/config-2.6.32-431-17.1.el6.i686 /tmp/linux-3.12.22
    • # May have forgotten to mention, the client is 32-bit because the laptop is super old
    • mv config-2.6.32-431-17.1.el6.i686 .config
    • make menuconfig
    • #Enable rbd driver
    • make
    • make modules_install install

    Doing it this way there are only a few "could not find module" errors (aes_generic and mperf to be specific) -- I'm not sure what they are, but hopefully they're not too important.

    Booted to 3.12.22, and my network is working now, this is good. Let us see if I can finally map the rbd device.


    Sweet baby Jesus I think it worked.

    • sudo rbd map TestBD
    • sudo mkfs.ext4 /dev/rbd1
      #It didn't tell me this is what it mapped it as, just had to look for it
    • sudo mkdir /CephBlock1
    • sudo mount /dev/rbd1 /CephBlock1
    • cd /CephBlock1
    • sudo touch IMadeaBlockDevice.txt

    Back to Using Ceph .... Again.


    Yep, appears to be working time to test some throughput, just going to do a dd with various block sizes to test the write speed of the drive. Command used:

    sudo dd if=/dev/zero of=/CephBlock1/ddspeedtest.txt bs=X count=Y oflag=direct

    Vary X and Y to keep amount of data transferred mostly consistant, oflag direct should keep it from buffering the writes, giving a better idea of actual drive performance. Also, the laptop (despite being old) and all the Nodes have gigabit ethernet cards connected to a gigabit switch -- so this shouldn't be a bottle neck.

    Ceph Block Device Write:
    Speed Block Size (bs) Count Total Data
    76 KB/s 4K 10000 41MB
    614 KB/s 32K 1250 41MB
    1.1 MB/s 64K 625 41MB
    2.1 MB/s 128K 156 41MB
    4.1 MB/s 256K 156 41MB
    6.3 MB/s 512K 78 41MB
    7.3 MB/s 1024K 39 41MB
    7.9 MB/s 2048K 20 42MB
    21.1 MB/s 4096K 10 42MB
    31.0 MB/s 8192K 5 42MB
    41.9 MB/s 16384K 3 50MB


    So there's some number, don't tell us much without a comparison so lets run this against one of the drives directly rather than through ceph.

    Not well, it turns out. Like, really not well.

    Speed for direct-drive-write2

    Block Size (KB) Count Data Speed (MB/s)
    4 10000 41 19.4
    32 1250 41 53.9
    64 625 41 57.4
    128 312 41 58.3
    256 156 41 51.7
    512 78 41 54.9
    1024 39 41 58.3
    2048 20 42 50.7
    4096 10 42 53.5
    8192 5 42 57.7
    16384 3 50 53.8

    Some quick math, that averages about 20% the speed with a range of .3% to 77%. Running the test a few more times indicates that the non-ceph test is a little more erratic. Except for the 4KB test, which is always lower (around 20MBs), the other vary back and forth between ~48 and ~61 MBs with no apparent pattern. So if we average that out excluding the 4KB, we're still only looking maybe 22% average -- assuming this even mixed block sized workload. So that's unfortunate that we're looking at such a massive performance hit using the ceph block device -- even if we assume large block size work loads (which may be a pretty big assumption), a ~30% performance impact is significant.

    To see if  performance continued to scale with block size, I ran a test with bs=1G count=5.  Result was 25.7 MB/s, so apparently performance becomes parabolic at some point. For comparison the same 5GB all-zeros text file wrote at a rate of 57.6 MB/s directly, and transferred (via scp) between two nodes at an average rate of 44.1 MB/s between two nodes.

    Final Thoughts for this Installment

    So initial impressions of using Ceph are not good. It's about six-and-a-half pains in the back to get setup and once it's setup performance is suboptimal. I'm going to do a few more posts where I play around with the other functionality of ceph and test out things like CephFS and Object Gateway (alternatives to using the block device), and management (how to get manually added daemons into init.d script). I'm also looking to test out failover and high availability to see what happens to data if a node or two goes offline. I'd also like to look at doing some more in depth performance testing, in a more real-world environment, but I'll have to think up a way to do that. It'd be cool to see if I can find out what the bottleneck is; Clearly it's not network or HDD, could it be processing power, memory, inherent bottleneck in the software?

    These will be saved for another time though, as once again this post has run (length and time wise) much longer than anticipated. I've also got a demo of Condusiv's V-Locity program I'm doing soon -- not really a competing product, beyond being about storage/IO -- so I may look at doing a "my experience with" post on that as well, so long as the reps I'm working with give me the OK.  Til next time.

    PS. let me know if there's any flaws in the way I tested the storage here. I know it's not exactly scientific or robust, but as far as I can tell it's not a bad first-impressions type test.