Monday, April 23, 2012

Data Protection Manager 2010 February 2012 Update Crash


Update: Microsoft has been able to reproduce the error and believes this is has been introduced in the latest February 2012 update.
Update 2: It has since been confirmed as a bug, with a hotfix coming.  In the meantime if you run in to this problem they should be able to fix it for you.  The bug is triggered when many events are written to the log at the same time, such as a hardware failure causing all active jobs to fail.


We use Microsoft Data Protection Manager (DPM) 2010 to protect our environment, including our Hyper-V failover clusters which use Cluster Shared Volumes (CSV).  Last Wednesday we installed the February update, and then on Friday one of our McData 4700 4GB FC switches had a port failure which crashed the entire SAN fabric (including the entirely separate, second path which was in no way tied to the primary path).  During this outage, which should never have happened in the first place, our DPM server corrupted itself.  The DPM server is completely separate from the Hyper-V Cluster, the only thing they share is the same network; therefore there is no reason for DPM to have crashed like this.

Now, whenever we attempt to start DPM we get this error in the event viewer (and the DPM service fails to start):

The description for Event ID 999 from source MSDPM cannot be found. Either the component that raises this event is not installed on your local computer or the installation is corrupted. You can install or repair the component on the local computer.


If the event originated on another computer, the display information had to be saved with the event.


The following information was included with the event: 


An unexpected error caused a failure for process 'msdpm'.  Restart the DPM process 'msdpm'.


Problem Details:
<FatalServiceError><__System><ID>19</ID><Seq>0</Seq><TimeCreated>4/23/2012 6:07:09 PM</TimeCreated><Source>DpmThreadPool.cs</Source><Line>163</Line><HasError>True</HasError></__System><ExceptionType>NullReferenceException</ExceptionType><ExceptionMessage>Object reference not set to an instance of an object.</ExceptionMessage><ExceptionDetails>System.NullReferenceException: Object reference not set to an instance of an object.
   at Microsoft.Internal.EnterpriseStorage.Dls.PRMCatalog.BackupEventIntegration.WriteBackupEvent(BackupEventEntry backupEventEntry)
   at Microsoft.Internal.EnterpriseStorage.Dls.PRMCatalog.BackupEventIntegration.WriteNonLoggedBackupEntries()
   at Microsoft.Internal.EnterpriseStorage.Dls.Prm.PRMHealthProvider.Initialize()
   at Microsoft.Internal.EnterpriseStorage.Dls.JobManager.JobManager.Initialize()
   at Microsoft.Internal.EnterpriseStorage.Dls.JobManager.JobManager.InitializeIfNecessary(Object state)
   at Microsoft.Internal.EnterpriseStorage.Dls.EngineUICommon.DpmThreadPool.Function(Object state)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallbackInternal(_ThreadPoolWaitCallback tpWaitCallBack)
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback(Object state)</ExceptionDetails></FatalServiceError>




the message resource is present but the message is not found in the string/message table

I opened a support ticket with Microsoft and within half an hour had a call from a tech.  After an hour of trying different things and digging around he asked for a copy of all of our logs, but the diagnostic logging tool they gave me wouldn't work properly because DPM had created 110GB of error log files in 3 days... whoops.  After deleting the repetitive crash logs (9,000 of them, around 12MB each) he got the upload and was able to look at the data.  He has since come to the same conclusion I have, that the crash was indeed a problem in DPM (no, it should not corrupt itself when the sources it is protecting go offline).  He has sent our (corrupt) DPM SQL DB to the "EE" team to look at it, but who knows how long it will take for them to figure out what went wrong and create a fix for us.  In order to get back online we uninstalled DPM, reinstalled, upgraded to QFE 2, and restored from the backup database from Wednesday when the February update was installed.  We then upgraded to the February update again because all of the protection agents needed it, and have set up a task to 

The good news is you don't need the DPM management console or service to be able to restore data.  In the event of a major SAN failure it would have been possible to recover our data using manual tools (with the help of microsoft support).  This is still very unnerving, and a prime example of why you should have good backups of the SQL database on your DPM server.