FIXEdge Cluster Solution Operation and Maintenance Guide

FIXEdge node management

Add node

User can add node to cluster by adding configuration information in configuration file and running playbook (see deployment manual).

Please be careful with deployment of database part in playbook. During initial deployment scripts in playbook create clean database structure. If you create playbook for adding node you must exclude database part, otherwise all data will be lost. It is strictly recommended to backup database before any actions with playbooks.

Remove node

To remove FIXEdge node user must do the following

  1. stop fixedge node ($sudo systemctl stop fixedge)
  2. remove healthcheck from Consul ($sudo rm /srv/consul-agent/etc/fixedge-respond-check.json). 
  3. Stop consul-agent service if you want to remove node from list of nodes in Consul GUI ($sudo systemctl stop consul-agent)

Node maintenance

Configuration

Sessions and schedules properties are managed via Configuration Service with the help of Configuration Tool. All other configuration parameters are managed via local property files similar to standalone FIXEdge (see)

Logs

FIXEdge node logs are collected in centralized storage - Splunk system. You can concern How to integrate FIXEdge with Splunk document.

User must create data source input on tcp port=1514, sourcetype=log4j

Simple search example:

Cluster services management

Running several instances of services

Deployment of additional instances of services is implemented via playbook similar to adding of FIXEdge nodes. User can run several instances of Configuration Service. All instances are equal (there is no leader), users can use any of them. Consul Coordinator allows several instances of service running.   

Removing of services can be done in the same way as for FIXEdge nodes.

Configuration service

Configuration Files

directory path for configuration files 

/srv/configuration-service

Logs

directory path for logs:

/srv/configuration-service/logs

logs can be viewed by command

$journalctl -u configuration-service

Consul agent

Consul agent is deployed to each hardware unit to enable connection of cluster services with Consul server. 

Configuration files

/srv/consul/etc

Logs

logs can be viewed by command

$journalctl -u consul-agent

Consul server

Configuration files

/srv/consul/etc

Logs

depends on customer configuration

Load-Balancer (HA-Proxy)

Configuration files

/etc/haproxy

Logs

/var/log/messages


Session management

User can add, modify and remove sessions with Configuration Tool.

Schedule management

User can add, modify and remove schedules with Configuration Tool.

Troubleshooting 

Check health state of services and nodes

Health of nodes and services can be seen in Consul WebUI. Healthy services and nodes are displayed as below.



You can also check services state on hardware units where service or node is deployed to via Linux CLI commands.

$sudo systemctl status <service name>.

For example:

$sudo systemctl status fixedge

$sudo systemctl status consul-agent

Example of healthy service:

[user@ecsc00a02a94 etc]$ sudo systemctl status consul-agent
? consul-agent.service - Consul Agent
   Loaded: loaded (/usr/lib/systemd/system/consul-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2019-03-14 11:08:45 UTC; 21h ago
 Main PID: 18977 (consul)
    Tasks: 9
   Memory: 24.6M
   CGroup: /system.slice/consul-agent.service


Unhealthy node looks like

[user@ecsc00a03c92 ~]$ sudo systemctl status fixedge
? fixedge.service - FIXEdge Server
   Loaded: loaded (/usr/lib/systemd/system/fixedge.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Fri 2019-03-15 08:56:32 UTC; 2s ago
  Process: 12401 ExecStop=/srv/fixedge/bin/FixEdge1.stop.sh (code=exited, status=0/SUCCESS)
  Process: 10168 ExecStart=/srv/fixedge/bin/FixEdge1.run.sh (code=killed, signal=TERM)
 Main PID: 10168 (code=killed, signal=TERM)



adding node unsuccessful -  host unreachable

Problem:  when user tries to add new node running playbook  he gets error message similar to:

PLAY [Deploy the FIXEdge daemon] **********************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************
task path: /home/egor/work/fixedge-cluster-deployment-playbook/addfe.yml:22
fatal: [10.6.217.42]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 10.6.217.42 port 22: Connection timed out\r\n", "unreachable": true}
 to retry, use: --limit @/home/egor/work/fixedge-cluster-deployment-playbook/addfe.retry
PLAY RECAP ********************************************************************************************************************************************************************************************************
10.6.217.42                : ok=0    changed=0    unreachable=1    failed=0  
consul_server_1            : ok=2    changed=0    unreachable=0    failed=0  
cs_1                       : ok=2    changed=0    unreachable=0    failed=0  
fixedge_1                  : ok=2    changed=0    unreachable=0    failed=0  
fixedge_2                  : ok=2    changed=0    unreachable=0    failed=0  
haproxy                    : ok=2    changed=0    unreachable=0    failed=0  
localhost                  : ok=2    changed=1    unreachable=0    failed=0

Description: node adding unsuccessful

Solution: Possible reason - node ip address is unreachable or user defined wrong ip address in configuration file. Check network connectivity and configuration file.  

can not establish session - problems with database

Problem:  when user tries to establish session he gets error message similar to:

[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : session could join infinite messages in bunch.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : FIX44 session was created.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : incoming sequence number was forcedly set to 1.
[3/12/2019 6:22:33 PM] Cannot find the property "StorageCreationTime" in the storage "D:\Distr\test\FIX_Antenna_NET40_2.26.0_288\samples\SimpleClient\x64-Release\.\logs\FIXCLIENT-FIXEDGE_1903121422333791", used current date.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Message storage was reset Locally.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : outgoing sequence number was forcedly set to 1.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Connection parameters were switched to primary set.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Connecting to 10.6.223.32:8901
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> :  Change state: old state=Initial new state=WaitForConfirmLogon
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 6:22:33 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE'.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Session 'FIXCLIENTFIXEDGE' tries to reconnect.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronous connect completed, error code: 0
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link was restored.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> :  Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 6:22:34 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE'.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Session 'FIXCLIENTFIXEDGE' tries to reconnect.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronous connect completed, error code: 0
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link was restored.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> :  Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 6:22:34 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE'.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 6:22:34 PM] Forced session reconnect for the session 'FIXCLIENTFIXEDGE' was not started because its reconnect tries has expired.
[3/12/2019 6:22:34 PM] Received Logout message:
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> :  Change state: old state=Reconnect new state=NonGracefullyTerminated

Description: session is not established (Non-Gracefully Terminated)

Solution: Possible reason - some problems with storing FIX messages. Check network connectivity with corresponding database. Check database health.  


can not establish session  -  load-balancer is unreachable or down

Problem:  when trying to establish session user gets error message similar to:

[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Incoming sequence number was restored: 2.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Outgoing sequence number was restored: 3.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : data about the previous run were found in the message storage.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : session could join infinite messages in bunch.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : FIX44 session was created.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : incoming sequence number was forcedly set to 1.
[3/12/2019 3:59:05 PM] Cannot find the property "StorageCreationTime" in the storage "D:\Distr\test\FIX_Antenna_NET40_2.26.0_288\samples\SimpleClient\x64-Release\.\logs\FIXCLIENT-FIXEDGE1_1903121159059131", used current date.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Message storage was reset Locally.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : outgoing sequence number was forcedly set to 1.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Connection parameters were switched to primary set.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Connecting to 10.6.223.32:8901
[3/12/2019 3:59:07 PM] Session <FIXCLIENT, FIXEDGE1> : Unable to establish connection: connect() to (10.6.223.32:8901) failed. No connection could be made because the target machine actively refused it. (Error code = 10061)
[3/12/2019 3:59:07 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Initial new state=NonGracefullyTerminated
[3/12/2019 3:59:07 PM] connect() to (10.6.223.32:8901) failed. No connection could be made because the target machine actively refused it. (Error code = 10061)

Description: session is not established (Non-Gracefully Terminated)

Solution: check Load-Balancer (HA-Proxy) health, if service is down then start it. Check network connectivity from client workstation to load-balancer


can not establish session - service discovery does not work

Problem:  when trying to establish session user gets error message similar to:

[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Incoming sequence number was restored: 72.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Outgoing sequence number was restored: 73.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : data about the previous run were found in the message storage.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : session could join infinite messages in bunch.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : FIX44 session was created.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : incoming sequence number was forcedly set to 1.
[3/12/2019 3:34:24 PM] Cannot find the property "StorageCreationTime" in the storage "D:\Distr\test\FIX_Antenna_NET40_2.26.0_288\samples\SimpleClient\x64-Release\.\logs\FIXCLIENT-FIXEDGE1_1903121134245091", used current date.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Message storage was reset Locally.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : outgoing sequence number was forcedly set to 1.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Connection parameters were switched to primary set.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Connecting to 10.6.223.32:8901
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Initial new state=WaitForConfirmLogon
[3/12/2019 3:34:27 PM] Session <FIXCLIENT, FIXEDGE1> : The session is non-gracefully terminated because the Logon acceptor did not respond in the given time frame (3 sec).
[3/12/2019 3:34:27 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=WaitForConfirmLogon new state=NonGracefullyTerminated
[3/12/2019 3:34:27 PM] Session <FIXCLIENT, FIXEDGE1> : active session was closed non-gracefully (The session is non-gracefully terminated because the Logon acceptor did not respond in the given time frame (3 sec).).
[3/12/2019 3:34:27 PM] Received Logout message:

Description: session is not established (Non-Gracefully Terminated)

Solution: check Service Discovery (Consul) health, if service is down then start it

can not establish session - no active nodes

Problem: when trying to establish session user gets error message similar to:

[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Incoming sequence number was restored: 1.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Outgoing sequence number was restored: 1.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : data about the previous run were found in the message storage.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : session could join infinite messages in bunch.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : FIX44 session was created.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : incoming sequence number was forcedly set to 1.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Connection parameters were switched to primary set.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Connecting to 10.6.223.32:8901
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Initial new state=WaitForConfirmLogon
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:46:47 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Session 'FIXCLIENTFIXEDGE1' tries to reconnect.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronous connect completed, error code: 0
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link was restored.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:46:47 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Session 'FIXCLIENTFIXEDGE1' tries to reconnect.
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronous connect completed, error code: 0
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link was restored.
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:46:48 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:46:48 PM] Forced session reconnect for the session 'FIXCLIENTFIXEDGE1' was not started because its reconnect tries has expired.
[3/12/2019 3:46:48 PM] Received Logout message:
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Reconnect new state=NonGracefullyTerminated

Description: session is not established (Non-Gracefully Terminated)

Solution: check FIXEdge nodes health - possible reason is all nodes are down. If so start at least one node and try to establish session again.


can not establish session - invalid login

Problem:  when trying to establish session user gets error message similar to:

[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Session 'FIXCLIENTFIXEDGE1' tries to reconnect.
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronous connect completed, error code: 0
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link was restored.
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:19:26 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:19:26 PM] Forced session reconnect for the session 'FIXCLIENTFIXEDGE1' was not started because its reconnect tries has expired.
[3/12/2019 3:19:26 PM] Received Logout message:
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> :  Change state: old state=Reconnect new state=NonGracefullyTerminated

Description: session is not established (Non-Gracefully Terminated)

Solution: check login and password parameters of session