In this blog post, i would like to discuss with you about a peculiar issue that I went through and had couple of reasons.
a) Microsoft worked on this case for 2-3 days and they had all necessary logs right in front of them to nail this issue. But could not. (Not criticizing. They are humans too, I get it).
b) It may be a rare scenario for the world but will be a very common scenario in Asia. Especially in countries in Asia those who implement strict data laws; have multiple SMTP domains for same company that end with country codes and will have different DAGs.
Here is the setup in nutshell.
a) 3 DAGs (Singapore [SG-DAG]/Hongkong [HK-DAG] and Indonesia[ID-DAG])
b) Hybrid setup done with Exchange Online.
c) Mailboxes from SG and HK being migrated. ID to be started.
d) Each location had an MRS endpoint and autodiscover was published.
e) New MRS Endpoint Introduced for ID and autodiscover for ID specific email domain published on internet.
e) While migrating mailboxes to cloud choose respective location where mailboxes are hosted and migrate them.
Everything working as expected; mailboxes hosted in SG/HK getting migrated without any issues.
Introducing the issue:
When auto discover and MRS endpoint for ID were published we observed sudden growth in MPLS usage between SG/HK and ID. All the bandwidth got consumed between these locations and the culprit were Exchange Servers in these locations talking to each other on SNPP TCP 444 port.
There was no mailbox being migrated to EXO or Synced. Why should this be the case. What is SNPP protocol and why are exchange servers in both location replicating so much of data? If the mailboxes were migrated they will be using the MRS Proxy endpoint that is published in Indonesia.
We raised a case with Microsoft and shared the logs with them. Took them a while and probably they saw ExchangeActivesync traffic causing it however could not pin-point what changed all of a sudden that caused this surge? We had not changed anything on Exchange apart from publishing Autodiscover URL for Indonesia smtp domain and MRS Endpoint for Indonesia Exchange servers.
I was travelling and could not focus on the issue. Landed in Singapore and spoke to my colleagues who shared a key piece of information and a log that helped to narrow down what the issue might be.
The clue came from the IIS Logs on Exchange servers in Indonesia. We saw a log of following POST events in IIS logs.
Log Location: C:\inetpub\logs\LogFiles\W3SVC2
Log shows POST Message from Singapore server on port 444 SNPP for DeviceType=OutlookService and OutlookServiceMrsAgent being used on the Indonesia server. This was logged for about 27 users in Indonesia.
The other findings from one of the colleague @Praveen_C_k showed huge amount of data flow reflecting the above findings.
It then struck me that this was possibly caused by using Outlook for iOS/Android with Hybrid Exchange On-Premises with EMS. We had enrolled into the TAP program with Microsoft and had setup the ability to allow users in our company to access their email via Outlook app on the mobile phones.
As part of this configuration you allow a service in Mobile Device Access Policy in On-Premises Exchange server by the name “Device Type=OutlookService”. This allows your users to use Outlook on mobile devices to access your on-premises mailbox securely where you apply Intune App Protection policies to Outlook app.
However, the catch is Microsoft caches 4 weeks of mail data for users on Exchange Online. This is very important to note. Microsoft does that because they then apply the EMS policies on this Data Cache as show below. What users access from mobile app is this cache. This caching mechanism uses the MRS Proxy endpoint on the Microsoft Exchange server on-premises. Here’s the deal though. You cannot specify multiple MRS Proxy endpoints for this caching mechanism. I believe it is one per Hybrid Config and that’s the issue we have.
With Exchange mailbox Migrations you have the freedom to specify MRS Proxy endpoint for a specific batch. In our case we can point to Indonesia MRS Endpoint that we have published while we migrate exchange mailboxes. However, for the user Data Cache we don't have such mechanism. I will work with Microsoft on this via our TAM and CxP colleagues.
a) Since the utilization was causing issues with other services, we had to edit the Mobile Device Access Policy to Quarantine the devices for “Device Type=OutlookService”. We are anyway migrating to Exchange Online and can live without Outlook as we also have alternatives in AirWatch/Good/BB.
b) Restart Microsoft Exchange Mailbox Replication service.
c) Reset IIS (This is a must else the traffic wont go down as the sessions keep going until you reset IIS).
Once we reset the IIS we saw the network utilization dropping.