Project 13: UDP Transmitter (MII TX)

Part of FPGA Trading Systems Portfolio

This project is part of a complete end-to-end trading system:

Main Repository: fpga-trading-systems
Project Number: 13 of 30
Category: FPGA Core
Dependencies: Project 8 (Order Book)

Overview

Hardware-accelerated order book implementation for high-frequency trading systems with UDP BBO transmission. Processes ITCH 5.0 market data messages in real-time, maintains order storage and price level aggregation, and tracks Best Bid/Offer (BBO) with sub-microsecond latency. BBO updates are transmitted via UDP to 192.168.0.93:5000 for low-latency market data distribution, while UART is reserved for debug messages only.

Trading Context: Order books are the fundamental data structure in electronic trading systems. Hardware implementation delivers deterministic latency and eliminates software stack overhead—critical advantages where microseconds directly impact profitability. UDP transmission enables real-time BBO distribution to trading algorithms with minimal overhead.

Status

Project Status: Functional - Multi-symbol order book with 8 parallel order books and spread calculation

Hardware Status: Synthesized, Programmed, and Verified on Arty A7-100T

Development Status: Integration and performance testing ongoing

Key Achievements:

Multi-symbol support: 8 parallel order books (AAPL, TSLA, SPY, QQQ, GOOGL, MSFT, AMZN, NVDA)
Round-robin BBO arbiter: Cycles through symbols with change detection
Spread calculation: Correctly calculates ask - bid for risk management
BRAM-based order storage: 1024 orders × 130 bits per symbol (32 RAMB36 tiles total)
BRAM-based price level table: 256 levels × 82 bits per symbol
Real-time BBO tracking with FSM scanner per symbol
ITCH message integration (A, E, X, D, U message types)
Production-grade BRAM inference (not LUTRAM)
Comprehensive debug infrastructure

Hardware Requirements

Board: Digilent Arty A7-100T Development Board
FPGA: Xilinx Artix-7 XC7A100T-1CSG324C
PHY: TI DP83848J 10/100 Ethernet (MII interface)
Tools: AMD Vivado Design Suite 2025.1

Features Implemented

UDP BBO Transmission

Real-time BBO Distribution (bbo_udp_formatter.vhd + eth_udp_send_wrapper.sv):

Transmits BBO updates via UDP when order book changes
Dynamic Configuration: Destination IP, MAC, and port configurable via UART commands (see UART_CONFIG_GUIDE.md)
Default Destination: 192.168.0.93:5000 (broadcast MAC: FF:FF:FF:FF:FF:FF)
Source: 192.168.0.212:5000 (FPGA MAC: 00:18:3E:04:5D:E7)
Payload: 256 bytes (28 bytes BBO data + 228 bytes padding)
UART used for dynamic configuration and debug messages

UART Configuration Commands

Dynamic Destination Setup (uart_config.vhd + uart_rx.vhd):

Configure UDP destination without reprogramming FPGA
ASCII command protocol via UART (115200 baud, 8N1)
Commands take effect immediately

Supported Commands:

IP:192.168.0.93\n       - Set destination IP address
MAC:FF:FF:FF:FF:FF:FF\n - Set destination MAC address
PORT:5000\n             - Set destination UDP port

Example Usage:

# Configure for C++ gateway on 192.168.0.100
IP:192.168.0.100
MAC:FF:FF:FF:FF:FF:FF
PORT:5000

Complete Documentation: See UART_CONFIG_GUIDE.md for detailed usage, testing procedures, and troubleshooting.

UDP Packet Format:

Offset	Size	Field	Description	Example Value
Ethernet Header (14 bytes)
0x00	6	Destination MAC	Broadcast address	`FF:FF:FF:FF:FF:FF`
0x06	6	Source MAC	FPGA MAC address	`00:18:3E:04:5D:E7`
0x0C	2	EtherType	IPv4	`0x0800`
IP Header (20 bytes)
0x0E	1	Version/IHL	IPv4, 20-byte header	`0x45`
0x10	2	Total Length	IP + UDP + Payload	`0x011C` (284 bytes)
0x17	1	Protocol	UDP	`0x11` (17)
0x1A	4	Source IP	FPGA IP address	`192.168.0.212`
0x1E	4	Destination IP	Target IP	`192.168.0.93`
UDP Header (8 bytes)
0x22	2	Source Port	FPGA UDP port	`0x1388` (5000)
0x24	2	Destination Port	Target UDP port	`0x1388` (5000)
0x26	2	Length	UDP header + payload	`0x0108` (264 bytes)
0x28	2	Checksum	Not computed	`0x0000`
UDP Payload (256 bytes)
0x2A - 0xFD	228	Padding	Zero padding	`0x00...`
BBO Data (28 bytes, at end of payload due to nibble reversal)
0xFE - 0x101	4	Spread	Ask - Bid (big-endian)	`0x000061A8` = 25,000
0x102 - 0x105	4	ASK Shares	Total ask shares (big-endian)	`0x0000012C` = 300
0x106 - 0x109	4	ASK Price	Best ask price (big-endian)	`0x00173180` = 1,520,000
0x10A - 0x10D	4	BID Shares	Total bid shares (big-endian)	`0x0000012C` = 300
0x10E - 0x111	4	BID Price	Best bid price (big-endian)	`0x0016CFD8` = 1,495,000
0x112 - 0x119	8	Symbol	Stock ticker (ASCII)	`"AAPL "`

Important Notes:

Byte Order: Multi-byte integers are in big-endian format (network byte order)
Price Format: Prices are in fixed-point format (4 decimal places): 1,495,000 = $149.50
BBO Location: Due to nibble-write order reversal, BBO data appears at bytes 228-255 instead of bytes 0-27
Symbol Padding: Symbol names are 8 bytes, space-padded (e.g., "AAPL ")

Example Packet (AAPL):

Hex dump (offsets 0x110-0x119, bytes 228-255 of payload):
0110: 00 00 61 a8 00 00 01 2c 00 17 31 80 00 00 01 2c
0120: 00 16 cf d8 41 41 50 4c 20 20 20 20

Decoded:
- Symbol: "AAPL    "
- BID: $149.50 (1,495,000), 300 shares
- ASK: $152.00 (1,520,000), 300 shares
- Spread: $2.50 (25,000)

Python Parsing Example:

import socket
import struct

def parse_bbo_packet(data):
    # Skip Ethernet (14) + IP (20) + UDP (8) = 42 bytes
    payload = data[42:]

    # BBO data is at bytes 228-255 (last 28 bytes)
    bbo_data = payload[228:256]

    # Unpack as big-endian (network byte order)
    spread, ask_shares, ask_price, bid_shares, bid_price = struct.unpack(
        '>IIIII', bbo_data[0:20]
    )
    symbol = bbo_data[20:28].decode('ascii').rstrip()

    # Convert to decimal (4 decimal places)
    return {
        'symbol': symbol,
        'bid_price': bid_price / 10000.0,
        'bid_shares': bid_shares,
        'ask_price': ask_price / 10000.0,
        'ask_shares': ask_shares,
        'spread': spread / 10000.0
    }

# Receive UDP packets
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.bind(('192.168.0.93', 5000))

while True:
    data, addr = sock.recvfrom(1024)
    bbo = parse_bbo_packet(data)
    print(f"[{bbo['symbol']}] Bid: ${bbo['bid_price']:.2f} ({bbo['bid_shares']}) | "
          f"Ask: ${bbo['ask_price']:.2f} ({bbo['ask_shares']}) | "
          f"Spread: ${bbo['spread']:.2f}")

C++ Parsing Example:

#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <cstring>

struct BBO {
    char symbol[9];     // 8 chars + null terminator
    uint32_t bid_price;
    uint32_t bid_shares;
    uint32_t ask_price;
    uint32_t ask_shares;
    uint32_t spread;
};

BBO parse_bbo_packet(const uint8_t* data, size_t len) {
    BBO bbo = {};

    // Skip to UDP payload (42 bytes header)
    const uint8_t* payload = data + 42;

    // BBO data at bytes 228-255
    const uint8_t* bbo_data = payload + 228;

    // Parse big-endian integers
    bbo.spread = ntohl(*(uint32_t*)(bbo_data + 0));
    bbo.ask_shares = ntohl(*(uint32_t*)(bbo_data + 4));
    bbo.ask_price = ntohl(*(uint32_t*)(bbo_data + 8));
    bbo.bid_shares = ntohl(*(uint32_t*)(bbo_data + 12));
    bbo.bid_price = ntohl(*(uint32_t*)(bbo_data + 16));

    // Copy symbol (8 bytes)
    memcpy(bbo.symbol, bbo_data + 20, 8);
    bbo.symbol[8] = '\0';

    // Trim trailing spaces
    for (int i = 7; i >= 0; i--) {
        if (bbo.symbol[i] == ' ') bbo.symbol[i] = '\0';
        else break;
    }

    return bbo;
}

SPI Slave Interface (Project 19 Integration)

PY32F030 Status Display (spi_slave_core.vhd + spi_register_if.vhd + spi_slave.vhd):

External ARM Cortex-M0 microcontroller (PY32F030) reads FPGA status via SPI
6-register bank: 4 read-only status inputs + 2 read-write configuration outputs
SPI Mode 0 (CPOL=0, CPHA=0), up to 10 MHz tested
Modular architecture: Generic SPI core + application-specific register mapping
Production-grade timing: 2-cycle pipeline for register reads, proper CDC handling

Register Map:

Address	Name	Type	Description	FPGA Source	PY32 Use
0x00	ORDER_COUNT	R	Total orders processed	Order book	Activity indicator
0x01	BBO_COUNT	R	BBO updates generated	BBO tracker	Update frequency
0x02	LATENCY_P50	R	P50 latency (ns)	Latency monitor	Performance metric
0x03	STATUS	R	System status flags	System monitor	Health check
0x04	SYMBOL_EN	RW	Symbol enable mask (8 bits)	PY32 config	Symbol filtering
0x05	THRESHOLD	RW	BBO spread threshold	PY32 config	Risk parameter

SPI Protocol:

Transaction Format: [CMD_BYTE][ADDR_BYTE][DATA_32BIT]
Command Bytes: 0x01=READ, 0x02=WRITE
Data Format: 32-bit big-endian (MSB first)
Clock Domain Crossing: SPI_SCK → 100 MHz via 2-FF synchronizer
Validation: 10,000+ SPI transactions tested, zero errors

Example PY32 Display Output:

Orders: 1 | BBO: 2 | Lat: 3 ns | Status: 0x00000004 | Symbol: 0xFF | Threshold: 1000

Architecture Benefits:

Resource Optimization: FPGA focuses on time-critical paths (312 ns ITCH-to-BBO, hardware-measured), not slow UI tasks
Dynamic Configuration: PY32 can update symbol filtering and risk thresholds via SPI writes
Independent Monitoring: External watchdog can reset FPGA if status registers freeze
Scalability: Register bank expandable to 256 registers (8-bit address space) for comprehensive monitoring

See Also: ../19-py32-fpga-status/README.md for complete SPI implementation details

Multi-Symbol Order Book Architecture

Multi-Symbol Wrapper (multi_symbol_order_book.vhd):

8 parallel order book instances (one per symbol)
Symbol demultiplexer routes ITCH messages to correct book
Round-robin BBO arbiter (40 µs per symbol @ 25 MHz)
Change detection: only outputs BBO when it changes
Supports: AAPL, TSLA, SPY, QQQ, GOOGL, MSFT, AMZN, NVDA

Order Storage (order_storage.vhd) - Per Symbol:

1024 concurrent orders per symbol
130-bit order entries (order_ref, price, shares, side, valid)
Simple Dual-Port BRAM (write port + read port, same clock)
2-cycle read latency pipeline
Order count tracking

Price Level Table (price_level_table.vhd) - Per Symbol:

256 price levels (128 bids + 128 asks)
82-bit level entries (price, total_shares, order_count, side, valid)
Read-First BRAM with 2-cycle read-modify-write pipeline
Address mapping: [0-127] = Bids (descending), [128-255] = Asks (ascending)
Level count tracking (active bid/ask levels)

BBO Tracker (bbo_tracker.vhd) - Per Symbol:

Finite state machine scans price level table
Finds highest bid (best bid) and lowest ask (best offer)
Calculates spread (ask - bid) in clocked FSM
Updates BBO on price level changes
2-cycle read latency handling

Order Book Manager (order_book_manager.vhd) - Per Symbol:

Top-level FSM coordinates all components
Handles ITCH message types: A (Add), E (Execute), X (Cancel), D (Delete), U (Replace)
Latency: ~12-17 clock cycles per message
Statistics tracking (order counts, level counts, lifetime operations)

ITCH Message Processing

Message Type	Action	Order Storage	Price Level	BBO Update
A (Add Order)	Add new order	Write order entry	Add shares to level	Trigger scan
E (Execute)	Reduce shares	Update shares	Remove shares from level	Trigger scan
X (Cancel)	Reduce shares	Update shares	Remove shares from level	Trigger scan
D (Delete)	Remove order	Mark invalid	Remove shares from level	Trigger scan
U (Replace)	Modify order	Update price/shares	Update both levels	Trigger scan

BRAM Inference Architecture

Critical Achievement: Both order_storage and price_level_table correctly infer Block RAM instead of Distributed RAM (LUTRAM).

Order Storage BRAM:

Simple Dual-Port pattern (write-only port A, read-only port B)
Separate valid_bits array for order counting (prevents read-modify-write on main BRAM)
ram_style attribute: "block" to force BRAM inference
Size: 1024 × 130 bits ≈ 16 KB (4 BRAM36 blocks)

Price Level Table BRAM:

Read-First Single-Port pattern (2-cycle read-modify-write pipeline)
Stage 1: Capture command, read old level from BRAM
Stage 2: Modify level, write back to BRAM
Explicit BRAM control signals (bram_do, bram_we, bram_addr, bram_di)
Size: 256 × 82 bits ≈ 2.5 KB (1 BRAM36 block)

Key Lesson: Read-modify-write patterns prevent BRAM inference. Separate read and write operations, or use separate storage for tracking data.

Debug Infrastructure

UART BBO Formatter (uart_bbo_formatter.vhd):

Real-time BBO output: Bid:0xXXXXXXXX | Ask:0xXXXXXXXX | Spr:0xXXXXXXXX
Debug fields: Tr=0x (trigger), Rd=0x (ready), Lv=0x (level valid), LdP=0xXXXXXXXX (level data price), LdA=0xXX (level address)
Write tracking: WrA=0xXX (write address), WrP=0xXXXXXXXX (write price), WrS=0xX (write side)
Statistics: Order counts, level counts, update counts

Example Output (Multi-Symbol):

[BBO:NODATA  ]
[BBO:AAPL    ]Bid:0x0016E360 (0x00000064) | Ask:0x0016D99C (0x000000C8) | Spr:0x00001388
[BBO:TSLA    ]Bid:0x003EC7E0 (0x00000014) | Ask:0x00432380 (0x0000000A) | Spr:0x00045BA0
[BBO:SPY     ]Bid:0x0031522C (0x000001F4) | Ask:0x003148CC (0x000001F4) | Spr:0x00000960
[BBO:QQQ     ]Bid:0x0020A440 (0x00000320) | Ask:0x0020A184 (0x00000384) | Spr:0x00000258
[BBO:GOOGL   ]Bid:0x00BEBCE8 (0x00000001) | Ask:0x00CEC408 (0x0000000B) | Spr:0x00100720
[BBO:MSFT    ]Bid:0x001ADB00 (0x00000001) | Ask:0x001ADB00 (0x00000001) | Spr:0x00000000
[BBO:AMZN    ]Bid:0x011E7EF8 (0x00000014) | Ask:0x011E5854 (0x00000014) | Spr:0x00000000
[BBO:NVDA    ]Bid:0x00232BE8 (0x00000044) | Ask:0x00241948 (0x00000080) | Spr:0x0000ED60

Format: [BBO:SYMBOL]Bid:0xPRICE (0xSHARES) | Ask:0xPRICE (0xSHARES) | Spr:0xSPREAD

Architecture

Module Hierarchy

mii_eth_top (top-level)
├── ITCH Parser Pipeline (from Project 7) - 25 MHz domain
│   ├── mii_rx
│   ├── mac_parser
│   ├── ip_parser
│   ├── udp_parser
│   ├── itch_parser
│   └── symbol_filter (filters to 8 tracked symbols)
├── Multi-Symbol Order Book System - 25 MHz domain
│   ├── multi_symbol_order_book
│   │   ├── Symbol Demultiplexer (routes messages to correct book)
│   │   ├── order_book_manager[0] - AAPL
│   │   │   ├── order_storage (4 RAMB36)
│   │   │   ├── price_level_table (1 RAMB36)
│   │   │   └── bbo_tracker
│   │   ├── order_book_manager[1] - TSLA
│   │   ├── order_book_manager[2] - SPY
│   │   ├── order_book_manager[3] - QQQ
│   │   ├── order_book_manager[4] - GOOGL
│   │   ├── order_book_manager[5] - MSFT
│   │   ├── order_book_manager[6] - AMZN
│   │   ├── order_book_manager[7] - NVDA
│   │   └── BBO Arbiter (round-robin with change detection)
│   └── CDC Synchronizer (25 MHz → 100 MHz)
├── UDP TX System - 100 MHz domain (NEW)
│   ├── bbo_udp_formatter (formats BBO as 256-byte UDP payload)
│   └── eth_udp_send_wrapper (SystemVerilog wrapper)
│       └── eth_udp_send (from fpga-ethernet-udp project)
│           └── MII TX (25 MHz domain)
├── UART Formatter - 100 MHz domain (DEBUG ONLY)
│   └── uart_bbo_formatter (includes symbol name)
└── UART TX

Data Flow

═══════════════════════════════════════════════════════════
ITCH Message Arrival (from Project 7)
═══════════════════════════════════════════════════════════
ITCH Parser (25 MHz)
    ↓ (parsed fields)
ITCH Message Encoder
    ↓ (324-bit serialized)
Async FIFO (Gray Code CDC)
    ↓
═══════════════════════════════════════════════════════════
Order Book Processing (100 MHz)
═══════════════════════════════════════════════════════════
ITCH Message Decoder
    ↓ (decoded fields: type, order_ref, price, shares, etc.)
Order Book Manager FSM
    ├─→ Order Storage (BRAM write/read)
    ├─→ Price Level Table (BRAM read-modify-write)
    └─→ BBO Tracker (scan price levels)
        ↓
BBO Output (bid_price, ask_price, spread, symbol, valid)
    ├───────────────────┬─────────────────────┐
    ↓                   ↓                     ↓
═══════════════════ ═══════════════════ ═══════════════════
UDP TX PATH (NEW)   UART DEBUG PATH     CDC to 100 MHz
═══════════════════ ═══════════════════ ═══════════════════
BBO UDP Formatter   UART BBO Formatter
    ↓                   ↓
256-byte Payload    ASCII output
    ↓                   ↓
eth_udp_send        UART TX
    ↓               (115200 baud)
MII TX (25 MHz)     (DEBUG ONLY)
    ↓
Ethernet PHY
    ↓
UDP Packet
192.168.0.212:5000
→ 192.168.0.93:5000
═══════════════════════════════════════════════════════════

Order Book Manager FSM

IDLE
  ↓ (itch_valid = '1')
LOOKUP_ORDER (for E/X/D/U - read existing order)
  ↓
ADD_ORDER / UPDATE_ORDER / DELETE_ORDER
  ↓ (write to order_storage)
WAIT_PRICE_CMD (2-cycle latency)
  ↓
UPDATE_PRICE_ADD / UPDATE_PRICE_REMOVE
  ↓ (read-modify-write price_level_table)
WAIT_PRICE_CMD (2-cycle latency)
  ↓
UPDATE_BBO (trigger bbo_tracker scan)
  ↓
WAIT_BBO (wait for scan complete)
  ↓
DONE → IDLE

BBO Tracker FSM

IDLE
  ↓ (update_trigger = '1')
SCAN_BIDS
  ├─→ SCAN_BIDS_WAIT1 (read latency cycle 1)
  ├─→ SCAN_BIDS_WAIT2 (read latency cycle 2)
  └─→ Check level_valid, level_data.valid, level_data.side
      ↓ (if valid bid found)
      Update best_bid_price_reg
  ↓ (scan_addr > 1)
  Continue scanning (decrement scan_addr)
  ↓ (scan_addr = 1, all bids scanned)
SCAN_ASKS
  ├─→ SCAN_ASKS_WAIT1 (read latency cycle 1)
  ├─→ SCAN_ASKS_WAIT2 (read latency cycle 2)
  └─→ Check level_valid, level_data.valid, level_data.side
      ↓ (if valid ask found)
      Update best_ask_price_reg
  ↓ (scan_addr = MAX_BID_LEVELS + MAX_ASK_LEVELS, all asks scanned)
COMPLETE
  ↓ (output BBO, assert bbo_update)
IDLE

Implementation Details

BRAM Inference Fixes

Problem: Initial implementation inferred LUTRAM (Distributed RAM) instead of Block RAM, causing:

Resource waste (LUTRAM uses logic resources)
Potential timing issues
Incorrect bid price values (read pipeline timing)

Root Causes Identified:

Read-Modify-Write Pattern (price_level_table.vhd):
- Reading from BRAM signal in write process prevented BRAM inference
- Solution: 2-stage pipeline (Stage 1: read, Stage 2: modify+write)
- Explicit BRAM control signals following Xilinx Read-First template
Read in Write Process (order_storage.vhd):
- Reading prev_valid from BRAM in write process created read-modify-write
- Solution: Separate valid_bits array for order counting
- Write process is now write-only (matches Simple Dual-Port template)
Missing ram_style Attribute:
- Added attribute ram_style : string; attribute ram_style of bram : signal is "block";
- Forces BRAM inference when code pattern matches template

Xilinx Templates Used:

simple_dual_one_clock.vhd - For order_storage (Simple Dual-Port)
rams_sp_rf.vhd - For price_level_table (Read-First Single-Port)

Address Mapping

Price to Address Conversion (price_to_addr function):

-- Bids: [0-127] (descending price order)
-- Asks: [128-255] (ascending price order)
-- Address offset: +1 to avoid address 0 (historical debugging)

if side = '0' then  -- Buy
    addr := resize(price_bits + 1, PRICE_ADDR_WIDTH);  -- [1-128]
else  -- Sell
    addr := resize(price_bits + 128 + 1, PRICE_ADDR_WIDTH);  -- [129-255]
end if;

BBO Scan Addresses:

Bids: Start at MAX_BID_LEVELS (128), scan down to 1
Asks: Start at MAX_BID_LEVELS + 1 (129), scan up to 255

Read Pipeline Latency

2-Cycle Latency Pattern:

Cycle 0: Assert rd_en / level_req, set address
Cycle 1: BRAM outputs data (registered)
Cycle 2: Data available on rd_data / level_data

Handling in FSM:

WAIT_PRICE_CMD state: wait_counter <= 2 (accounts for 2-cycle latency)
SCAN_BIDS_WAIT1 / SCAN_BIDS_WAIT2: Two wait states for read latency
SCAN_ASKS_WAIT1 / SCAN_ASKS_WAIT2: Two wait states for read latency

Debug Journey: Bid Price Issue

Symptom: Bid prices consistently 0x00000000 while ask prices worked correctly.

Debug Process:

Added debug outputs: SA (scan address), BdP (bid price), BdV (bid valid), St (state)
Discovered: BBO tracker stuck in IDLE, scan_addr not initialized
Fixed: scan_addr reset initialization
Discovered: bbo_trigger never set, bbo_ready always high
Fixed: bbo_trigger timing in UPDATE_BBO / WAIT_BBO states
Discovered: LdP=0x00000000 even when Lv=1 (level valid but price zero)
Fixed: Read pipeline timing in price_level_table (rd_valid_pending signal)
Discovered: BRAM inferring LUTRAM instead of BRAM
Fixed: Refactored to Xilinx BRAM templates, added ram_style attribute
Result: Bid prices now show correct values

Key Debug Signals Added:

debug_level_valid - Level valid signal from price level table
debug_level_data_price - Raw price read from level_data
debug_level_addr - Address being read (captured when level_req asserted)
debug_wr_addr, debug_wr_price, debug_wr_side, debug_wr_valid - Write operation tracking

Building the Design

Prerequisites

Vivado 2025.1 (or compatible version)
Windows PC (universal build.tcl works on Windows)
Git for version control
Project 7 (ITCH parser) as dependency

Build Commands

Use the universal build script from repository root:

REM Full build (synthesis + implementation + bitstream)
REM Auto-increments build version
build 08-order-book

REM Program FPGA
prog 08-order-book

Build time: ~15-20 minutes on typical desktop

Build Version: Displayed in build log:

==========================================
BUILD VERSION: X
==========================================

Testing

Hardware Setup

Connect Arty A7 to PC via USB (JTAG + UART)
Connect Ethernet cable from PC/Network switch to Arty A7
Configure Ethernet adapter:
- IP: 192.168.1.10
- Subnet: 255.255.255.0
- No gateway needed

Open serial terminal (115200 baud, 8N1):

python -m serial.tools.miniterm COM3 115200

Test Procedure

1. Add Order Test

cd 07-itch-parser-v4\test
python send_itch_packets.py --target 192.168.1.10 --port 12345 --test add_order

Expected UART output:

[BBO] Bid:0x0016E360 | Ask:0xFFFFFFFF | Spr:0xFFFFFFFF (BW=00 AW=00) A0W=00 P=00000000 S=00000000

Shows: Bid price $150.00 (0x0016E360), no ask yet (0xFFFFFFFF = invalid)

2. Complete Order Lifecycle

python send_itch_packets.py --target 192.168.1.10 --port 12345 --test lifecycle

Sends sequence:

Add Order (Buy) → Bid price appears
Add Order (Sell) → Ask price appears, spread calculated
Execute → Shares reduced, BBO updated
Cancel → Shares reduced, BBO updated
Delete → Order removed, BBO updated

Verification:

BBO prices update correctly
Spread calculated: Spr = Ask - Bid
Order counts increment/decrement
Level counts track active price levels

3. Multiple Price Levels

python send_itch_packets.py --target 192.168.1.10 --port 12345 --test multi_level

Expected: BBO shows best bid (highest) and best ask (lowest), even with multiple orders at different prices

Debug Output Interpretation

BBO Format:

[BBO] Bid:0x0016E360 | Ask:0x0016D99C | Spr:0x00001388
      ^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^
      Best bid price   Best ask price   Spread (ask - bid)

Debug Fields:

Tr=0x - BBO trigger (1 = scan triggered)
Rd=0x - BBO ready (1 = scan complete, 0 = scanning)
Lv=0x - Level valid (1 = level data available)
LdP=0xXXXXXXXX - Level data price (price read from level)
LdA=0xXX - Level data address (address being scanned)
WrA=0xXX - Write address (when write occurs)
WrP=0xXXXXXXXX - Write price (price being written)
WrS=0xX - Write side (0=bid, 1=ask)

Statistics Fields:

BLv=XX - Bid level count (active bid price levels)
ALv=XX - Ask level count (active ask price levels)
BOrd=XXXX - Bid order count (active buy orders)
AOrd=XXXX - Ask order count (active sell orders)
Upd=XXXX - Update count (total BBO updates)

Troubleshooting

Symptom	Possible Cause	Solution
Bid prices always 0x00000000	BRAM inferring LUTRAM	Check synthesis report, verify BRAM templates
BBO not updating	`bbo_trigger` not set	Check `UPDATE_BBO` / `WAIT_BBO` states
Level data always zero	Read pipeline timing	Verify 2-cycle latency handling
Multiple driver errors	Signal driven from multiple processes	Consolidate signal assignments
Buffer overflow in UART	Debug fields exceed buffer size	Increase `byte_array` size in formatter

Performance Metrics

Test Data

The order book has been validated using real-world NASDAQ market data:

Source: 12302019.NASDAQ_ITCH50 (December 30, 2019 trading day)

Total Dataset: ~250 million ITCH 5.0 messages (8 GB binary file)
MySQL Database: 50 million records imported (first 3 hours of trading)
Test Dataset: 80,000 messages (10,000 per symbol)
Symbols: AAPL, TSLA, SPY, QQQ, GOOGL, MSFT, AMZN, NVDA
Message Mix: 98.2% Add Orders (A), 1.8% Trades (P)
Test Rate: 600+ messages/second sustained

All performance metrics below are based on processing this real-world trading day data.

Detailed Information: See ../docs/database.md for complete extraction process, message distribution, and data quality validation.

Latency

Order processing: ~12-17 clock cycles per message (@ 100 MHz = 120-170 ns)
BBO update: ~260 clock cycles (128 bids + 128 asks × 2 cycles/level) = 2.6 μs
ITCH parse → UDP TX: 312 ns (4-point hardware-measured on AX7203)
- ITCH parsing: 288 ns (36 cycles @ 125 MHz)
- FIFO read → UDP TX start: 24 ns (3 cycles @ 125 MHz)

Resource Utilization

Actual for Artix-7 XC7A100T (Multi-Symbol Implementation):

Resource	Used	Available	Utilization
Slice LUTs	~30,000	63,400	~47%
Slice Registers	~16,000	126,800	~13%
RAMB36 Tiles	32	135	23.7%
RAMB18 Tiles	2	270	0.74%
DSP Slices	0	240	0%

BRAM Breakdown (Multi-Symbol):

order_storage × 8 symbols: 32 RAMB36 blocks (1024 × 130 bits each)
price_level_table × 8 symbols: Included in above (256 × 82 bits each)
async_fifo: 2 RAMB18 blocks (512 × 324 bits)

Resource Scalability:

Single symbol: 4 RAMB36 per order book
8 symbols: 32 RAMB36 (24% utilization)
Headroom: 76% BRAM capacity remaining for additional features

Timing

System clock: 100 MHz (10 ns period)
Worst Negative Slack (WNS): > 0 ns (timing met)
Critical path: BRAM read paths, BBO scanner FSM

Key Design Decisions

BRAM Inference Strategy

Requirement: Efficient on-chip memory for 1024 orders and 256 price levels.

Implementation: Xilinx BRAM templates for guaranteed Block RAM inference:

Simple Dual-Port (order_storage): Separate write and read processes eliminates read-modify-write conflicts
Read-First Single-Port (price_level_table): 2-stage pipeline (read → modify → write)
ram_style attribute: Explicit directive forces BRAM when code matches template

Rationale: Synthesis tools use pattern-matching for memory inference. Template compliance guarantees Block RAM instead of distributed LUT RAM, saving logic resources and improving timing.

Architectural Separation for Complex Operations

Challenge: Order counting requires reading valid status during write operations—creates read-modify-write pattern preventing BRAM inference.

Solution: Separate valid_bits array tracks order validity independently from main BRAM storage.

Trade-off: Additional logic resources for tracking array, but enables proper BRAM inference for primary storage. Net resource savings and better timing closure.

Debug Instrumentation Philosophy

Approach: Comprehensive UART output of internal state:

Scan addresses, read data, write operations
FSM states, trigger signals, ready flags
Performance counters (order counts, level counts, update counts)

Rationale: Hardware debugging without visibility is speculation. Strategic instrumentation enabled:

Systematic root cause diagnosis (BRAM inference issue identified in 2 build cycles)
Performance characterization (actual vs expected latency)
Production validation (BBO correctness verification)

Cost: ~500 LUTs for debug formatter. Benefit: 10x faster debug cycles.

Pipeline Latency Handling

BRAM Characteristic: 1-2 cycle read latency (registered output).

FSM Design: Explicit wait states in all read paths:

wait_counter tracks pipeline stages
Separate WAIT states for each read operation
BBO scanner includes WAIT1/WAIT2 states for 2-cycle latency

Validation: Simulation waveforms verify data availability timing before hardware deployment.

Production Trading System Applicability

Architecture Patterns:

BRAM-Based Storage: On-chip memory architecture scales to multi-symbol order books
Multi-Stage FSMs: Deterministic latency pipelines essential for HFT systems
Memory Inference Control: Template-based design guarantees resource utilization
Systematic Debug: Instrumentation enables rapid production issue diagnosis
Latency Budgeting: Sub-microsecond processing meets HFT requirements

Real-World Relevance:

Core Infrastructure: Order books are fundamental to exchange matching engines, market makers, HFT systems
Deterministic Performance: Fixed-cycle FSMs eliminate software non-determinism (no GC pauses, cache misses, context switches)
Scalability Path: BRAM architecture extends to multiple symbols, deeper books, additional order types
Production Debugging: Instrumentation techniques apply directly to production FPGA trading systems where observability is limited

Files Structure

Core Modules

order_book_manager.vhd - Top-level FSM coordinating all components
order_storage.vhd - BRAM-based order storage (1024 orders)
price_level_table.vhd - BRAM-based price level aggregation (256 levels)
bbo_tracker.vhd - FSM scanner for Best Bid/Offer tracking
order_book_pkg.vhd - Constants, types, helper functions

Integration

mii_eth_top.vhd - Top-level integration with ITCH parser (from Project 7)
uart_bbo_formatter.vhd - UART output formatter for BBO and debug data
itch_msg_decoder.vhd - ITCH message decoder (from Project 7)

Supporting Files

async_fifo.vhd - Clock domain crossing FIFO (from Project 7)
itch_msg_pkg.vhd - ITCH message encoding/decoding (from Project 7)
All ITCH parser modules (from Project 7)

Future Enhancements

** Phase 2: Multi-Symbol Support** - COMPLETE

Symbol filtering integration (8 symbols: AAPL, TSLA, SPY, QQQ, GOOGL, MSFT, AMZN, NVDA)
Per-symbol order books (8 parallel instances)
Symbol-based BBO tracking with round-robin arbiter
Spread calculation for risk management

Phase 3: Order Matching - Next Steps

Price-time priority matching
Trade execution logic
Fill reporting

Phase 4: Market Data Output

Level 2 market data (full depth)
Order book snapshots
Real-time updates via Ethernet

** Phase 5: C++ Order Gateway (Project 9)** - COMPLETE (See 09-order-gateway-cpp/)

UART BBO parser (C++)
Multi-protocol output: TCP JSON, MQTT, Kafka
Real-time market data distribution
Integration with FPGA order book
Live chart display in Java desktop application

Project Status

Status: Functional - Integration Testing in Progress

Created: November 2025

Last Updated: November 2025 - Multi-Symbol Order Book with Spread Calculation Complete

Known Issues

CRITICAL BUG: ITCH Message Routing Incomplete

Problem: The order book only processes Add Order ('A') messages correctly. Execute ('E'), Cancel ('X'), Delete ('D'), and Replace ('U') messages reach the order book but have no effect because critical input signals are hardcoded to zero.

Root Cause: In multi_symbol_order_book.vhd, the following signals are hardcoded:

itch_exec_shares    => (others => '0'),  -- Should carry shares for 'E' messages
itch_cancel_shares  => (others => '0'),  -- Should carry shares for 'X' messages
itch_new_order_ref  => (others => '0'),  -- Should carry new_order_ref for 'U' messages
itch_new_price      => (others => '0'),  -- Should carry new_price for 'U' messages
itch_new_shares     => (others => '0'),  -- Should carry new_shares for 'U' messages

Impact:

Orders never removed from book (Execute, Cancel, Delete have no effect)
Order Replace messages do not update prices
BBO accumulates stale prices over time, leading to crossed markets (bid > ask)
Order book appears functional initially but degrades with sustained message flow

Workaround: None. This project is provided as a reference implementation.

Fix Available: See Project 23 (23-order-book/) which contains the complete fix:

CDC FIFO expanded from 320 to 448 bits to carry all message fields
trading_top.vhd updated to route all order-affecting messages through the FIFO
multi_symbol_order_book.vhd connects exec_shares/cancel_shares properly
order_book_manager.vhd includes full Order Replace ('U') handler

Recent Fixes

Multi-Symbol Support (November 2025):

Implemented multi_symbol_order_book.vhd wrapper with 8 parallel order books
Symbol demultiplexer routes ITCH messages to correct book based on symbol match
Round-robin BBO arbiter cycles through 8 symbols with change detection
Per-symbol BBO tracking maintains independent state for each symbol
Resource usage: 32 RAMB36 tiles (23.7% utilization) - well within capacity

Spread Calculation & BBO Persistence Fix (November 2025):

Fixed bbo_tracker.vhd spread calculation by moving to clocked FSM (was combinational process)
Added best_spread_reg register and calculate in COMPUTE_SPREAD state
Fixed multi_symbol_order_book.vhd missing spread output port
Connected spread through complete data path: bbo_tracker → order_book_manager → multi_symbol_order_book → mii_eth_top → UART
CRITICAL FIX: Removed data-clearing logic in bbo_tracker.vhd that was wiping BBO registers when one side was empty
Root Cause: COMPUTE_SPREAD state cleared all price registers when best_bid_found='0' OR best_ask_found='0'
Impact: Orders were being added correctly but BBO scan would clear bid data when no asks existed (and vice versa)
Solution: Removed clearing logic from SCAN_BIDS/SCAN_ASKS completion (lines 139-143, 187-191)
Result: Price registers now persist between scans, only updated when valid data is found
CRITICAL FIX: Changed COMPUTE_SPREAD validation from scan flags to register contents
Root Cause: best_bid_found and best_ask_found flags only tracked current scan, not accumulated data
Impact: Spread was always 0 because flags were cleared at start of each scan, even if registers had valid data
Solution: Check actual register values (best_bid_price_reg /= 0x00000000 and best_ask_price_reg /= 0xFFFFFFFF) instead of scan flags
Result: Spread now correctly calculates ask_price - bid_price when both sides exist in register
Spread now correctly calculates ask_price - bid_price for all symbols
BBO maintains both bid and ask sides simultaneously (no longer clears one when updating the other)

BRAM Inference Fixes (November 2025):

Fixed order_storage.vhd LUTRAM inference by separating read and write processes (Simple Dual-Port pattern)
Fixed price_level_table.vhd LUTRAM inference by implementing 2-stage read-modify-write pipeline (Read-First Single-Port pattern)
Added ram_style attribute to force BRAM inference after template refactoring
Resolved bid price issue (consistently 0x00000000) through BRAM template compliance

Debug Infrastructure (November 2025):

Added comprehensive UART debug outputs: scan addresses, read data, write operations, state machine status
Fixed BBO tracker initialization and trigger timing
Fixed read pipeline latency handling (2-cycle BRAM latency)

Architecture Improvements (November 2025):

Refactored order_storage to use separate valid_bits array for order counting (prevents read-modify-write on main BRAM)
Refactored price_level_table to explicit BRAM control signals following Xilinx template
Updated order_book_manager to account for 2-cycle price level table latency

BBO UART Format Enhancements (November 2025):

Added symbol name to BBO output: [BBO:AAPL ] instead of generic [BBO]
Added bid_shares and ask_shares to output format: Bid:0xPRICE (0xSHARES)
Added spread to output: Spr:0xSPREAD
Added [BBO:NODATA ] status message when order book is empty (vs repeating stale prices)
Fixed symbol byte order (MSB-first extraction from FILTER_SYMBOL_LIST constant)
Disabled heartbeat trigger to prevent false activity in C++ gateway (Project 9 integration)
BBO now only sent when prices, shares, or valid status actually change

Credits and Acknowledgments

Third-Party Components

eth_udp_send SystemVerilog Module:

Source: fpga-ethernet-udp by Adam Christiansen
License: MIT License
Usage: Core UDP transmission module providing UDP/IP packet construction and MII TX interface
Integration: Custom VHDL wrapper (eth_udp_send_wrapper.sv) created to flatten SystemVerilog interfaces for VHDL compatibility

Project 13 Original Work:

bbo_udp_formatter.vhd - BBO to UDP payload formatter with pipelined state machine
eth_udp_send_wrapper.sv - SystemVerilog/VHDL language interoperability bridge
XDC timing constraints for generated clock domains (clk_25mhz)
Integration into mii_eth_top.vhd
UDP packet format specification and parsing examples (Python/C++)
Complete documentation and architecture updates

Attribution: The fpga-ethernet-udp project by Adam Christiansen provides the excellent eth_udp_send SystemVerilog implementation that handles low-level UDP/IP packet construction and MII TX interface timing. Project 13 integrates this module through a custom wrapper pattern and implements the application-specific BBO formatting and clock domain management.

This project demonstrates production-grade FPGA design for trading systems, including BRAM architecture, FSM design, mixed-language integration, and comprehensive debugging techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
constraints		constraints
ip/eth_udp_fifo_async		ip/eth_udp_fifo_async
simulation		simulation
src		src
test		test
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.tcl		build.tcl
build_version.txt		build_version.txt

Folders and files

Latest commit

History

Repository files navigation

Project 13: UDP Transmitter (MII TX)

Part of FPGA Trading Systems Portfolio

Overview

Status

Hardware Requirements

Features Implemented

UDP BBO Transmission

UART Configuration Commands

SPI Slave Interface (Project 19 Integration)

Multi-Symbol Order Book Architecture

ITCH Message Processing

BRAM Inference Architecture

Debug Infrastructure

Architecture

Module Hierarchy

Data Flow

Order Book Manager FSM

BBO Tracker FSM

Implementation Details

BRAM Inference Fixes

Address Mapping

Read Pipeline Latency

Debug Journey: Bid Price Issue

Building the Design

Prerequisites

Build Commands

Testing

Hardware Setup

Test Procedure

1. Add Order Test

2. Complete Order Lifecycle

3. Multiple Price Levels

Debug Output Interpretation

Troubleshooting

Performance Metrics

Test Data

Latency

Resource Utilization

Timing

Key Design Decisions

BRAM Inference Strategy

Architectural Separation for Complex Operations

Debug Instrumentation Philosophy

Pipeline Latency Handling

Production Trading System Applicability

Files Structure

Core Modules

Integration

Supporting Files

Future Enhancements

Project Status

Known Issues

CRITICAL BUG: ITCH Message Routing Incomplete

Recent Fixes

Credits and Acknowledgments

Third-Party Components

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages