FIXEdge Cluster Solution Operation and Maintenance Guide
FIXEdge node management
Add node
User can add node to cluster by adding configuration information in configuration file and running playbook (see deployment manual).
Please be careful with deployment of database part in playbook. During initial deployment scripts in playbook create clean database structure. If you create playbook for adding node you must exclude database part, otherwise all data will be lost. It is strictly recommended to backup database before any actions with playbooks.
Remove node
To remove FIXEdge node user must do the following
- stop fixedge node (
$sudo systemctl stop fixedge)
- remove healthcheck from Consul (
$sudo rm /srv/consul-agent/etc/fixedge-respond-check.json
). - Stop
consul-agent
service if you want to remove node from list of nodes in Consul GUI ($sudo systemctl stop consul-agent
)
Node maintenance
Configuration
Sessions and schedules properties are managed via Configuration Service with the help of Configuration Tool. All other configuration parameters are managed via local property files similar to standalone FIXEdge (see)
Logs
FIXEdge node logs are collected in centralized storage - Splunk system. You can concern How to integrate FIXEdge with Splunk document.
User must create data source input on tcp port=1514, sourcetype=log4j
Simple search example:
Cluster services management
Running several instances of services
Deployment of additional instances of services is implemented via playbook similar to adding of FIXEdge nodes. User can run several instances of Configuration Service. All instances are equal (there is no leader), users can use any of them. Consul Coordinator allows several instances of service running.
Removing of services can be done in the same way as for FIXEdge nodes.
Configuration service
Configuration Files
directory path for configuration files
/srv/configuration-service
Logs
directory path for logs:
/srv/configuration-service/logs
logs can be viewed by command
$journalctl -u configuration-service
Consul agent
Consul agent is deployed to each hardware unit to enable connection of cluster services with Consul server.
Configuration files
/srv/consul/etc
Logs
logs can be viewed by command
$journalctl -u consul-agent
Consul server
Configuration files
/srv/consul/etc
Logs
depends on customer configuration
Load-Balancer (HA-Proxy)
Configuration files
/etc/haproxy
Logs
/var/log/messages
Session management
User can add, modify and remove sessions with Configuration Tool.
Schedule management
User can add, modify and remove schedules with Configuration Tool.
Troubleshooting
Check health state of services and nodes
Health of nodes and services can be seen in Consul WebUI. Healthy services and nodes are displayed as below.
You can also check services state on hardware units where service or node is deployed to via Linux CLI commands.
$sudo systemctl status <service name>
.
For example:
$sudo systemctl status fixedge
$sudo systemctl status consul-agent
Example of healthy service:
[user@ecsc00a02a94 etc]$ sudo systemctl status consul-agent ? consul-agent.service - Consul Agent Loaded: loaded (/usr/lib/systemd/system/consul-agent.service; enabled; vendor preset: disabled) Active: active (running) since Thu 2019-03-14 11:08:45 UTC; 21h ago Main PID: 18977 (consul) Tasks: 9 Memory: 24.6M CGroup: /system.slice/consul-agent.service
Unhealthy node looks like
[user@ecsc00a03c92 ~]$ sudo systemctl status fixedge ? fixedge.service - FIXEdge Server Loaded: loaded (/usr/lib/systemd/system/fixedge.service; enabled; vendor preset: disabled) Active: inactive (dead) since Fri 2019-03-15 08:56:32 UTC; 2s ago Process: 12401 ExecStop=/srv/fixedge/bin/FixEdge1.stop.sh (code=exited, status=0/SUCCESS) Process: 10168 ExecStart=/srv/fixedge/bin/FixEdge1.run.sh (code=killed, signal=TERM) Main PID: 10168 (code=killed, signal=TERM)
adding node unsuccessful - host unreachable
Problem: when user tries to add new node running playbook he gets error message similar to:
PLAY [Deploy the FIXEdge daemon] **********************************************************************************************************************************************************************************
TASK [Gathering Facts] ********************************************************************************************************************************************************************************************
task path: /home/egor/work/fixedge-cluster-deployment-playbook/addfe.yml:22
fatal: [10.6.217.42]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 10.6.217.42 port 22: Connection timed out\r\n", "unreachable": true}
to retry, use: --limit @/home/egor/work/fixedge-cluster-deployment-playbook/addfe.retry
PLAY RECAP ********************************************************************************************************************************************************************************************************
10.6.217.42 : ok=0 changed=0 unreachable=1 failed=0
consul_server_1 : ok=2 changed=0 unreachable=0 failed=0
cs_1 : ok=2 changed=0 unreachable=0 failed=0
fixedge_1 : ok=2 changed=0 unreachable=0 failed=0
fixedge_2 : ok=2 changed=0 unreachable=0 failed=0
haproxy : ok=2 changed=0 unreachable=0 failed=0
localhost : ok=2 changed=1 unreachable=0 failed=0
Description: node adding unsuccessful
Solution: Possible reason - node ip address is unreachable or user defined wrong ip address in configuration file. Check network connectivity and configuration file.
can not establish session - problems with database
Problem: when user tries to establish session he gets error message similar to:
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : session could join infinite messages in bunch.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : FIX44 session was created.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : incoming sequence number was forcedly set to 1.
[3/12/2019 6:22:33 PM] Cannot find the property "StorageCreationTime" in the storage "D:\Distr\test\FIX_Antenna_NET40_2.26.0_288\samples\SimpleClient\x64-Release\.\logs\FIXCLIENT-FIXEDGE_1903121422333791", used current date.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Message storage was reset Locally.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : outgoing sequence number was forcedly set to 1.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Connection parameters were switched to primary set.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Connecting to 10.6.223.32:8901
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Change state: old state=Initial new state=WaitForConfirmLogon
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 6:22:33 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE'.
[3/12/2019 6:22:33 PM] Session <FIXCLIENT, FIXEDGE> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Session 'FIXCLIENTFIXEDGE' tries to reconnect.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronous connect completed, error code: 0
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link was restored.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 6:22:34 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE'.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Session 'FIXCLIENTFIXEDGE' tries to reconnect.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Asynchronous connect completed, error code: 0
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link was restored.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 6:22:34 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE'.
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 6:22:34 PM] Forced session reconnect for the session 'FIXCLIENTFIXEDGE' was not started because its reconnect tries has expired.
[3/12/2019 6:22:34 PM] Received Logout message:
[3/12/2019 6:22:34 PM] Session <FIXCLIENT, FIXEDGE> : Change state: old state=Reconnect new state=NonGracefullyTerminated
Description: session is not established (Non-Gracefully Terminated)
Solution: Possible reason - some problems with storing FIX messages. Check network connectivity with corresponding database. Check database health.
can not establish session - load-balancer is unreachable or down
Problem: when trying to establish session user gets error message similar to:
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Incoming sequence number was restored: 2.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Outgoing sequence number was restored: 3.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : data about the previous run were found in the message storage.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : session could join infinite messages in bunch.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : FIX44 session was created.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : incoming sequence number was forcedly set to 1.
[3/12/2019 3:59:05 PM] Cannot find the property "StorageCreationTime" in the storage "D:\Distr\test\FIX_Antenna_NET40_2.26.0_288\samples\SimpleClient\x64-Release\.\logs\FIXCLIENT-FIXEDGE1_1903121159059131", used current date.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Message storage was reset Locally.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : outgoing sequence number was forcedly set to 1.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Connection parameters were switched to primary set.
[3/12/2019 3:59:05 PM] Session <FIXCLIENT, FIXEDGE1> : Connecting to 10.6.223.32:8901
[3/12/2019 3:59:07 PM] Session <FIXCLIENT, FIXEDGE1> : Unable to establish connection: connect() to (10.6.223.32:8901) failed. No connection could be made because the target machine actively refused it. (Error code = 10061)
[3/12/2019 3:59:07 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Initial new state=NonGracefullyTerminated
[3/12/2019 3:59:07 PM] connect() to (10.6.223.32:8901) failed. No connection could be made because the target machine actively refused it. (Error code = 10061)
Description: session is not established (Non-Gracefully Terminated)
Solution: check Load-Balancer (HA-Proxy) health, if service is down then start it. Check network connectivity from client workstation to load-balancer
can not establish session - service discovery does not work
Problem: when trying to establish session user gets error message similar to:
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Incoming sequence number was restored: 72.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Outgoing sequence number was restored: 73.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : data about the previous run were found in the message storage.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : session could join infinite messages in bunch.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : FIX44 session was created.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : incoming sequence number was forcedly set to 1.
[3/12/2019 3:34:24 PM] Cannot find the property "StorageCreationTime" in the storage "D:\Distr\test\FIX_Antenna_NET40_2.26.0_288\samples\SimpleClient\x64-Release\.\logs\FIXCLIENT-FIXEDGE1_1903121134245091", used current date.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Message storage was reset Locally.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : outgoing sequence number was forcedly set to 1.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Connection parameters were switched to primary set.
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Connecting to 10.6.223.32:8901
[3/12/2019 3:34:24 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Initial new state=WaitForConfirmLogon
[3/12/2019 3:34:27 PM] Session <FIXCLIENT, FIXEDGE1> : The session is non-gracefully terminated because the Logon acceptor did not respond in the given time frame (3 sec).
[3/12/2019 3:34:27 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=WaitForConfirmLogon new state=NonGracefullyTerminated
[3/12/2019 3:34:27 PM] Session <FIXCLIENT, FIXEDGE1> : active session was closed non-gracefully (The session is non-gracefully terminated because the Logon acceptor did not respond in the given time frame (3 sec).).
[3/12/2019 3:34:27 PM] Received Logout message:
Description: session is not established (Non-Gracefully Terminated)
Solution: check Service Discovery (Consul) health, if service is down then start it
can not establish session - no active nodes
Problem: when trying to establish session user gets error message similar to:
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Incoming sequence number was restored: 1.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Outgoing sequence number was restored: 1.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : data about the previous run were found in the message storage.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : session could join infinite messages in bunch.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : FIX44 session was created.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : incoming sequence number was forcedly set to 1.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Connection parameters were switched to primary set.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Connecting to 10.6.223.32:8901
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Initial new state=WaitForConfirmLogon
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:46:47 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Session 'FIXCLIENTFIXEDGE1' tries to reconnect.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronous connect completed, error code: 0
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link was restored.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:46:47 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:46:47 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Session 'FIXCLIENTFIXEDGE1' tries to reconnect.
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronous connect completed, error code: 0
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link was restored.
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:46:48 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:46:48 PM] Forced session reconnect for the session 'FIXCLIENTFIXEDGE1' was not started because its reconnect tries has expired.
[3/12/2019 3:46:48 PM] Received Logout message:
[3/12/2019 3:46:48 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Reconnect new state=NonGracefullyTerminated
Description: session is not established (Non-Gracefully Terminated)
Solution: check FIXEdge nodes health - possible reason is all nodes are down. If so start at least one node and try to establish session again.
can not establish session - invalid login
Problem: when trying to establish session user gets error message similar to:
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Session 'FIXCLIENTFIXEDGE1' tries to reconnect.
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronously connecting to 10.6.223.32:8901
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Asynchronous connect completed, error code: 0
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link was restored.
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Reconnect new state=WaitForConfirmLogon
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : the telecommunication link error was detected (Connection::receive(), EOF from 10.6.223.32:8901).
[3/12/2019 3:19:26 PM] Start forced session reconnect for the session 'FIXCLIENTFIXEDGE1'.
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=WaitForConfirmLogon new state=Reconnect
[3/12/2019 3:19:26 PM] Forced session reconnect for the session 'FIXCLIENTFIXEDGE1' was not started because its reconnect tries has expired.
[3/12/2019 3:19:26 PM] Received Logout message:
[3/12/2019 3:19:26 PM] Session <FIXCLIENT, FIXEDGE1> : Change state: old state=Reconnect new state=NonGracefullyTerminated
Description: session is not established (Non-Gracefully Terminated)
Solution: check login and password parameters of session