Node Operations Best Practices

Whether operating nodes in the public cloud or on-premise, the following best practices are recommended to ensure availability and reliability of your Stream Nodes.

Maintenance

1

Regular Updates

  • Keep your node software up to date with the latest security patches and feature updates. Regularly check for new releases.
  • Using the deployment scripts, you can bring a new instance of a FE up into validation mode, and once you are confident it is passing health checks, you can switch it to be your new Running node and de-provision your prior FE.
  • Storage schema updates are performed via db migrations using golang-migrate and are performed automatically within releases as part of Node FE upgrades.
  • You are encouraged to set STANDBYONSTART to true in your environment variables if you are running blue green deployments behind a proxy or network load balancer. This will allow you to bring up a new instance of the FE in standby mode, and once it is passing health checks, it will automatically switch to primary.

If your node operator has not yet been fully registered or your node has not yet been registered, you can test running an unattached Node Fe to ensure health checks pass by setting STANDBYONSTART=true. A node started in this manner will not attach to Storage layer or shutdown if it is not registered yet on the River Chain.
2

Backups

  • Implement routine backups of your Node Storage to prevent data loss.
  • Establish an operating procedure for automated or manual data restoration from backups in the event of an outage.
  • Use monitoring tools to track your node’s performance and health.
  • If your node storage crashes, you can restore a backup, and recover/catchup from the peers responsible for the same chunks of data.
As of October 2024, stream replication has not yet been implemented. Until stream replication is implemented, data recovery cannot be achieved from peers. Therefore, it is important that node operators backup their node storage regularly and ensure high availability through their cloud provider database setup.
3

Monitoring

  • Regularly review your node’s performance. Use logs, metrics, and profiling tools exposed by node to tune your observability stack.
  • Adjust resource allocation and network settings as needed to optimize for throughput and reliability.
By settings METRICS__ENABLED=true in your node’s environment, you can enable detailed metrics collection for your node. Metrics are instrumented using Open Telemetry and can be used to monitor your node’s performance and health by navigating to the metrics endpoint at https://<node-hostname>/metrics. See node observability for more information.

Troubleshooting

Using logs, metrics and profiling tools, you can identify and resolve issues with your node. Some issues may have specific resolutions on the River Issue Tracker.

If there’s a new bug or security vulnerability found, please file an issue on the River Issue Tracker for the core development team to address.

1

Common Issues

  • Address typical problems such as connectivity issues, slow transaction processing, or database errors with targeted troubleshooting steps provided in the network’s documentation.
2

Diagnostic Tools

  • Utilize built-in diagnostic tools and logs to identify issues. Follow systematic troubleshooting procedures to resolve operational problems.

Security

1

Access Control

  • Implement strict access controls for administrative operations. Use secure authentication methods to protect against unauthorized access.
2

Encryption and Network Security

  • Secure data in transit and at rest using encryption. Apply network security best practices, such as firewalls and secure protocols, to protect against external attacks.
3

Regular Security Audits

  • Conduct regular security audits to identify and mitigate potential vulnerabilities. Stay informed about the latest security threats and apply recommended countermeasures.