Monday, January 12, 2009

BizTalk 2006 - Missing Tracking Data (HAT and BAM)

I was reluctant to publish post, but have decided to because at the end of the day I hope this post saves some people time and frustration. However, I won't be able to sleep at night without posting the following disclaimer:

This post describes invasive techniques that will remove data from your BizTalk Databases. Some of these procedures may not be supported, by Microsoft, in production environments. Also, when in doubt please contact Microsoft Premiere support. Proceed at your own risk.
In our Test environment we were experiencing missing BAM data. The processes that BAM was tracking were completing successfully. We tried to reproduce the problem in other environments by deploying the BizTalk project and BAM artifacts to other servers. We were unable to re-produce the problems in these environments. That eliminated any coding issues as this project has been live for some time and a 2nd iteration was just being implemented.

Knowing that the issue was not code related forced us to focus on the current infrastructure. Since I was not on this project, I started running some tests hoping to re-produce the problem. This did not become too much of an issue as submitting as little as 8 messages to BizTalk would create some anomalies.

Anomaly #1 - Missing HAT Data
While trying to reproduce the problem, I noticed that we were missing some Tracking data related to HAT.

We would see instances that had started and that "apparently" had not completed. When stepping through the orchestration debugger there were some common shapes where the tracking would just stop.

However, within this process we have several log points where we will insert a log event into the event viewer. We also are communicating with a few different systems and were able to determine that these downstream systems were receiving messages from BizTalk.

Anomaly #2 - Missing BAM Data
Within the BAM Portal we would find columns that did not contain any information. We are using milestones, so the process was completing yet, there was data that was missing. Within this process we are using both the BAM API and TPE. TPE only tracks data that hits ports where as we had some information that never hits a port which forced us to use the BAM API.

What did not add up was that for an instance that was completed, and had missing data, the Expression shape that included the BAM API Call was being executed.

At this point I thought that the two anomalies could be related since we are taking advantage of the OrchestrationEventStream within the API. Darren Jefford has a great explanation of the OrchestrationEventStream here so I am not going to try and come up with something better here.

After some further investigation I was not getting too far so I figured it was time to reach out for some technical support. What I was told was that: "the TDDS sequence numbers in TDDS_SteamStatus tables and TrackingData tables have gotten out of sync." This will result in the TDDS subservice discarding Tracking data due to the synchronization issue. This definitely explains the missing data in HAT.

The previous description is more directly related to the missing HAT data. Since the BAM data is essentially Tracking data and following the same process as HAT data, due to the OrchestrationEventStream, it was plausible that both events were completely related.

Here are the steps that we performed to solve our issue:
(Disclaimer in effect - Please execute caution before running these steps. There is no way to be 100% sure that your issue was the same as mine. If you are unsure, contact Microsoft.)
  • Stop all BizTalk Host Instances
  • Stop SQL Agent (These two steps will ensure that no one is connecting to the database(s) )
  • Ensure that you have no Active, Dehydrated, or ready to run instances in BizTalk Admin
  • Verify that the TrackingData table has a row count of zero (You will need to drain any remaining messages by starting your tracking host back up, turn it off when row count gets to zero)
  • Backup the Management, MsgBox, DTADb(Tracking), BAMPrimaryImport databases
  • Truncate the following ables: BizTalkDTADB.TDDS_StreamStatus, BAMPrimaryImport.TDDS_StreamStatus

After performing these actions, we were set to run some tests to try and reproduce the issue. After running several scenarios and putting greater than production loads on the environment we still could not re-produce the issue.

What is still concerning about this issue is how it originally happened. There were no obvious changes to our environment that prompted this issue. We were also lucky that the event occurred in the the Test environment as rebuilding the environment was on the table at one point.

No comments: