Last week I dealt with a very interesting and unusual SCCM failure. It has started with a link failure error between CAS and one of the primary sites. When I ran Replication Link Analyser the first error message was “SQL Server Broker login is missing for sites: <my primary site code>”. After that RLA informed me that the login is recreated but in fact it was not and the issue was still there. I was also unable to find any information on how to recreate SQL Server Broker or at least what it is.
After more rigorous search I have found the following SQL command which shows you SQL replication status in real time.
Use CM_CAS; Select * from sys.transmission_queue
The content of that table should change dynamically. In my case there was a bunch of stalled messages with ConfigMgr_Site<My Primary Site Code (PSC)> in to_service_name column and “Connection attempt failed with error: 10060” in transmission_status column. That gave me clear indication that the Service Broker transmission is broken between my CAS and PSS.
Note: when the transmission is resumed SQL should clear up stuck messages however sometimes you might need to clear them up yourself using update sys.transmission_queue Also please note that any intervention to the SCCM database is not supported by Microsoft.
After several telnet tests I have figured out that Service Broker is not responding or listening on PSS server database.
In our environment all SQL servers are shared hosts so all Service Brokers are using private ports. To identify the port used by Service Broker run the following SQL script on your SQL instance
Use CM_CAS select port from tcp_endpoints where type_desc like ‘%SERVICE_BROKER%’
Note that there could be only one Service Broker endpoint per database.
I have executed the query above and the result was nil. That gave me an understanding that somehow my Service Broker was deleted on the database.
At that stage I was about to give up. There is a script which creates Service Broker endpoint but I know that SCCM secures all internal communications with certificates had no idea which certificate to use. I’ve been thinking to either call Microsoft or reinstall the site (including several role servers) but fortunately I have found the required script on Internet.
CREATE ENDPOINT [ConfigMgrEndpoint]
STATE=STARTED
AS TCP (LISTENER_PORT = <my port>, LISTENER_IP = ALL)
FOR SERVICE_BROKER (MESSAGE_FORWARDING = ENABLED
,
MESSAGE_FORWARD_SIZE = 5
, AUTHENTICATION = CERTIFICATE
[ConfigMgrEndpointCert]
, ENCRYPTION = REQUIRED ALGORITHM AES)
GO
All good but how would I know what private port was used by my missing Service Broker? In SQL Management Studio go to CM_CAS\Service Broker\Routes\ConfigMgrDRSSiteRoute_<your PSC>, click on properties and in properties check for Address which would look like TCP://<your PSS FQDN>:<port>.
Once I’ve executed the SQL script above my telnet test has succeeded. I ran RLA again and it gave me “SQL Server Broker login is missing for sites: <my primary site code>” error again but this time it’s succeeded in fixing this issue and this error has not re-appear again.
I thought now it’s just a matter of time but in several hours I still saw no activity in rcmctrl.log. However all error messages from sys.transmission_queue have gone.
So I had another look at the link status, specifically at Initialization Detail tab. It is very important to look at it from both sides, i.e. both CAS and PSS. On PSS side I’ve noticed that one replication group has stuck at 1% replicating up to CAS.
There is a way to reset replication group. You have to create <replication group name>.pub file and place it to the rcm.box inbox. This file should disappear in 5-10 seconds. If it does not disappear at all it clearly indicates that the issue is on another end. Delete it and try from another side.
Once I’ve dropped the PUB file to the rcm.box it has pulled the plug. I have started to see replication activity in rcmctrl.log and file exchange in rcm.box. The issue has gone in hour and a half.
Several important things to remember when you have SCCM 2012 replication issue:
- Start your troubleshooting with RLA.
- If the primary site sits in link failure state for substantial amount of time SCCM puts the primary site in read only mode and the link in the maintenance state
- If the issue is not fixed SCCM will also put CAS database in maintenance mode, consequently the rest of links will fail.
- Check sys.transmission_queue for stuck transactions. The content of this table must rapidly change.
- Check rcmctrl.log for any activity.
- Identify your Service Broker ports and run telnet connectivity tests.
- Check CM_<site code>\Service Broker\Queues if any queue is down.
- The easiest way to reset replication group replication is to drop <replication group name>.pub file to the rcm.box inbox. Note that the PUB file name should be <replication group name>.pub on PSS, on CAS it should be <replication group name>-<primary site code>.pub. The file should disappear in 5-10 seconds. If it does not, the issue is on another end. Delete the file and try on another end.