Back to Blog
Technology

Stop Losing Fills: The FIX Session Lifecycle Checklist Every Broker Needs

David KovačDavid Kovač
April 19, 20267 min read13 views
Stop Losing Fills: The FIX Session Lifecycle Checklist Every Broker Needs

Brokers don’t usually “lose trades” because FIX is unreliable. They lose trades because the FIX session lifecycle is mis-implemented or mis-operated: heartbeats aren’t respected, sequence numbers drift, resends are mishandled, and sessions get bounced without a clean recovery plan.

This post breaks down the FIX session lifecycle—Logon/Logout, Heartbeats, and Sequence Numbers—and ties each concept to the real operational failures that cause missed fills, duplicated orders, or position mismatches between your platform, bridge, and liquidity providers (LPs).

1) FIX session lifecycle in one picture: connect, sync, prove liveness, recover

A FIX “session” is not just a TCP connection. It’s a stateful conversation where both sides must agree on:

  • Who you are (SenderCompID/TargetCompID, credentials)
  • Where you are in the message stream (MsgSeqNum)
  • Whether the other side is alive (Heartbeats/Test Requests)
  • How to recover if anything is missed (Resend Requests, Gap Fill)

When any of those drift, you can still have an open socket and still be “connected” in dashboards—while your order and execution flow is logically broken.

For brokers and prop firms, the risk isn’t theoretical:

  • You can accept a client order in MT5/cTrader, route it, and then miss the execution report that confirms the fill.
  • You can reconnect and accidentally replay old messages into your OMS/bridge.
  • You can end up with a clean LP position but a dirty internal ledger (or the reverse), which becomes a reconciliation and client-dispute problem.

2) Logon done right: negotiate state, not just credentials

A FIX Logon (MsgType=A) is where sessions are either made safe—or made fragile.

At logon time, sequence numbers matter immediately. Each side sends its current outbound MsgSeqNum and expects the counterparty to do the same. If the numbers don’t line up, the session must decide whether to:

  • Resend missing messages (safe when implemented correctly)
  • Reset sequence numbers (fast, but dangerous if you don’t reconcile)

Practical logon checklist for brokers:

  • Persist sequence numbers across restarts. If your FIX engine “forgets” MsgSeqNum on reboot, you’ll force resets or resends that can corrupt state.
  • Treat ResetSeqNumFlag as an operational event, not a convenience. Sequence resets can be valid (e.g., start-of-day policies), but they must be coordinated and logged.
  • Fail fast on CompID / credential mismatches. Misrouted sessions (wrong TargetCompID) can lead to silent rejects or messages going to the wrong venue configuration.

Operationally, the biggest logon mistake is “green light equals safe.” A successful logon only means you authenticated and agreed to talk—not that your message streams are synchronized.

3) Logout and disconnects: the difference between clean shutdown and state corruption

Logout (MsgType=5) is where many environments get sloppy—especially during maintenance windows, LP failovers, or bridge restarts.

A clean logout helps you:

  • stop sending new business messages
  • flush any buffered messages
  • record the final sequence state

The failure mode is common: systems drop TCP without a proper Logout, then reconnect and either:

  • request a resend flood that overwhelms downstream components, or
  • reset sequence numbers to “get back online,” masking the fact that executions may be missing.

What to enforce in ops runbooks:

  • Planned restarts require a controlled logout. If you’re restarting a bridge, FIX gateway, or risk engine, treat it like a change-managed event.
  • Unplanned disconnects require an explicit recovery path. Don’t let teams “just reconnect” until you’ve decided: resend vs reset, and how you will reconcile.
  • Always capture session transcripts. For disputes, you need message-level evidence (with timestamps) of what was sent/received.

From a compliance perspective, keeping robust logs supports best-practice recordkeeping and auditability—always check local regulations and your liquidity agreements for retention and reporting requirements.

4) Heartbeats & Test Requests: liveness is not latency

Heartbeats (MsgType=0) exist to prove the session is alive when no application messages are flowing. Test Request (MsgType=1) is the “are you there?” nudge when heartbeats are late.

Two practical misunderstandings cause real losses:

  1. Heartbeat timeouts are treated as “network noise.” If you ignore repeated Test Requests or heartbeat misses, you can be in a half-dead state where one side is sending but the other isn’t processing.

  2. Heartbeat intervals are set without considering GC pauses, CPU spikes, or downstream backpressure. In busy market conditions, your process might be alive but unable to respond quickly.

Broker-grade heartbeat checklist:

  • Set heartbeat interval based on realistic worst-case processing, not “what works in UAT.”
  • Alert on Test Requests and missed heartbeats as early indicators of processing stalls.
  • Separate network health from application health. A stable ping doesn’t mean your FIX engine threads aren’t blocked.

Why this ties to “lost trades”: a heartbeat timeout often triggers a disconnect/reconnect cycle. If your resend logic is weak, the reconnect is where you lose (or duplicate) executions.

5) Sequence numbers & resends: where most ‘lost fill’ incidents actually start

MsgSeqNum (tag 34) is the spine of FIX reliability. Each side increments for every message it sends. When a gap is detected, the receiver asks for the missing range using Resend Request (MsgType=2).

This is the critical point: resend is normal. What’s abnormal is mishandling it.

Common broker failure patterns:

  • Gap detected → immediate sequence reset. Fastest way to get “connected,” but you may permanently skip executions you never processed.
  • Resend flood → downstream overload. Your OMS/bridge/risk layer isn’t idempotent and can’t handle replays.
  • Duplicate processing of Execution Reports. If your trade capture doesn’t dedupe on unique identifiers (e.g., ExecID/ClOrdID/OrderID depending on flow), you can book the same fill twice.

Implementation guardrails (practical, not academic):

  • Make trade capture idempotent. Your internal ledger should safely handle replays.
  • Persist inbound/outbound sequence state to durable storage. Memory-only state is a time bomb.
  • Support Gap Fill correctly. When resending, FIX engines can send administrative “gap fill” Sequence Reset messages to skip ranges that will not be resent; mishandling this creates phantom gaps.

If you’re running multiple venues/LPs, treat each FIX session as its own reliability domain. A clean session with LP-A doesn’t protect you from a broken session with LP-B.

6) A broker’s incident playbook: diagnose, recover, reconcile

When something goes wrong, the goal is not “restore connectivity.” The goal is restore correctness.

Use this incident flow (works for brokers, prop firms, and liquidity bridges):

  • Step 1: Freeze risk. If you suspect missing executions, consider halting new routing (or switching to a controlled fallback) until state is confirmed.
  • Step 2: Identify the first sequence gap. Find the exact MsgSeqNum where divergence started.
  • Step 3: Choose recovery method.
    • Prefer Resend Request when you trust both sides’ logs and your system is replay-safe.
    • Use sequence reset only with a documented reconciliation step.
  • Step 4: Reconcile positions and cash. Compare:
    • LP statements / drop copy (if available)
    • bridge/aggregator fills
    • platform positions (MT5/cTrader)
    • internal ledger (CRM/BO)
  • Step 5: Post-incident hardening. Add alerts, improve dedupe, adjust heartbeat intervals, and tighten restart procedures.

Two controls that pay off quickly:

  • Real-time execution reconciliation (LP vs bridge vs platform) to catch missing fills within minutes, not hours.
  • Session health dashboards that track: last inbound/outbound seq, resend counts, test request spikes, logout reasons.

The Bottom Line

The FIX session lifecycle is a reliability system: logon/logout defines state, heartbeats prove liveness, and sequence numbers enforce completeness.

Most “lost trade” incidents are really recovery failures—gaps, resends, or resets handled without idempotency and reconciliation.

If you run broker or prop infrastructure, treat FIX session ops as part of your risk framework, document your recovery playbooks, and align logging with your compliance obligations.

If you want help designing FIX connectivity that stays correct under stress, talk to Brokeret at /get-started.

Share:TwitterLinkedIn