Updating Coordinating Node Software

The set of individual Coordinating Nodes (CNs) in an environment need to run the same version of the DataONE software stack. Yet, especially in production, we cannot take all of the CNs down at the same time to perform an upgrade. Instead, we follow a procedure that takes advantage of the built-in CN redundancy and the DNS round robin to achieve zero downtime. Specifically, we divide the CNs into 2 groups, upgrade one group then the other, while adjusting the DNS round robin to resolve only to the nodes not being upgraded.

The important thing to note is that the individual CNs communicate with each other through different channels, and during the upgrade procedure, we need to close the channels gracefully to maintain data consistency.

Before starting, see the Prerequisites document to make sure you have everything you need. Especially if completing the upgrade needs to happen within a time window, it is recommended to make sure you have all necessary resources on hand.

Current Production CNs

IP FQDN Tag
128.111.54.80 cn-ucsb-1.dataone.org CN1
160.36.13.150 cn-orc-1.dataone.org CN2
64.106.40.6 cn-unm-1.dataone.org CN3

1. Prepare for the Release

1.1 Confirm Prerequisites

Make sure you are set up properly as a System Administrator.

1.2 Determine Active and Passive Nodes

As of 11/05/2014, we have only a single Active Master CN with two Passive Master Nodes. (The passive CNs may or may not be in the DNS RR.)

Under the single-active node operation scheme, the UCSB node is the active master node, meaning it’s the only one running synchronization and replication services (d1-processing). The other nodes are the passive nodes, and will be the ones to upgrade first (UNM and ORC).

Confirm active node by logging onto each machine and running the following:

root@cn-ucsb-1:/etc/dataone/process#  ps -p $(cat /var/run/d1-processing.pid)
PID TTY          TIME CMD
15411 ?        00:03:20 jsvc
root@cn-ucsb-1:/etc/dataone/process# echo $?
0

D1-processing is confirmed to be running if echo $? returns 0

If d1-processing is not running on any CN instance, check the processing capabilities of each CN (start with UCSB) to see which are configured to run synchronization, replication, and logAggregation. See section 4.1.1 for details.

1.3 Record Initial DNS Round Robin configuration

Make note of the starting configuration of the DNS round robin to know what the final configuration should be restored to at the end of the upgrade:

dig <round-robin address>

each CN in the DNS RR will have an entry in the ”;;ANSWER SECTION”

The example below shows one CN (128.111.54.80) behind the round robin address:

$ dig cn.dataone.org

; <<>> DiG 9.8.3-P1 <<>> cn.dataone.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 65388
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;cn.dataone.org.      IN  A

;; ANSWER SECTION:
cn.dataone.org.   60  IN  A 128.111.54.80

;; Query time: 461 msec
;; SERVER: 10.0.1.1#53(10.0.1.1)
;; WHEN: Fri Mar  6 17:33:20 2015
;; MSG SIZE  rcvd: 48

2. Release Announcement

A CN upgrade is a new release, and you will need to notify the appropriate communities, even though there is zero downtime.

  • Create Redmine (redmine.dataone.org) Wiki page for the release, containing Release Notes
  • Send a release announcement to the operations listserve at DataONE containing:
    • a link to the Wiki page with release notes
    • the expected time window needed
    • note that the environment will be in read-only mode
    • any special considerations, for example if it’s a testing environment and you are not doing zero downtime
  • Request DataONE Administrator for help with DNS RR
    • if all CNs are in the DNS RR, you will need 4 switches (beginning, middle, 2x near the end)
    • otherwise only 2 switches are needed (middle, end)

3. Remove Passive Nodes from DNS RR

Contact a DataONE Administrator to remove the passive CN instances from the DNS Round Robin. These will be the CNs you will be working on first. The DataONE Administrators will need to know the ip address of those machines.

(You will be able to complete step 4 in parallel with step 3)

4. Go into Read-Only Mode

In read-only mode, all processing daemons are stopped and their property files are reconfigured to inactivate them.

4.1 Turn off d1-processing

Note this step is only for the active CNs you identified in step 1.2.

4.1.1 Inactivate d1-processing modules

Set all the processing components to inactive.

In /etc/dataone/process there are three property files:

  • logAggregation.properties
  • synchronization.properties
  • replication.properties

In logAggregation.properties set the LogAggregator.active to FALSE. In synchronization.properties set the Synchronization.active to FALSE. In replication.properties set the Replication.active to FALSE.

cd /etc/dataone/process
# to show the existing settings
grep 'active' *.properties
#
# run this command if logAggregation LogAggregator.active=TRUE
cat logAggregation.properties | tee logAggregation.properties.bak | sed 's/LogAggregator.active=TRUE/LogAggregator.active=FALSE/' > tmp.properties && mv tmp.properties logAggregation.properties
# run this command if logAggregation Synchronization.active=TRUE
cat synchronization.properties | tee synchronization.properties.bak | sed 's/Synchronization.active=TRUE/Synchronization.active=FALSE/' > tmp.properties && mv tmp.properties synchronization.properties
# run this command if logAggregation Replication.active=TRUE
cat replication.properties | tee replication.properties.bak | sed 's/Replication.active=TRUE/Replication.active=FALSE/' > tmp.properties && mv tmp.properties replication.properties
#
# to confirm new settings
grep 'active' *.properties

4.1.2 check logs for evidence of inactivation

If d1-processing is not running, you can skip this step. Otherwise, you need to make sure the configurations you set in the previous step take effect before stopping d1-processing. This is done by checking each component’s log files for logging messages asserting that the component is turned off.

component log filepath
synchronization /var/log/dataone/synchronize/cn-synchronization.log
logAggregation /var/log/dataone/logAggregate/cn-aggregation.log
replication /var/log/dataone/replicate/cn-replication.log

4.1.3 stop the processing daemon

Once all components have inactivated themselves, you can stop processing with:

$ sudo /etc/init.d/d1-processing stop

4.2 Turn off Index Processing

Stop Generator and Processor for ALL CNs:

$ sudo /etc/init.d/d1-index-task-processor stop
$ sudo /etc/init.d/d1-index-task-generator stop

4.3. Disable Cluster Communications

This script will turn off the correct ports and perform other settings manipulation needed for the upgrade procedure. On all CNs, run this command:

$ sudo /usr/local/bin/togglePortsAndReplication.sh disable

4.4 Set / Confirm Read Only Mode

The /etc/dataone/node.properties file should have a property named ‘cn.storage.readOnly.’ If not, add it. The property should be set to TRUE for all CN instances.

5. Upgrade Passive CN instances

See CN Instance Upgrade instructions, and note all options chosen so the same are followed in step 7.

5.1 Re-enable cluster commuminications of Passive CN Instances

Re-enabling cluster communiations early, especially Hazelcast, can save time to build and distribute the shared SystemMetadata map.

$ sudo /usr/local/bin/togglePortsAndReplication.sh enable

6. Switch DNS Round Robin

Switch the DNS Round Robin to point to one of the upgraded CNs from the first upgrade set.

Note

ORC is usually the one chosen, based on network access reliability

7. Upgrade the Active CN instance(s)

See CN Instance Upgrade instructions, and be sure to use the same options as in step 5.

8. Put Original Active CN(s) into Service

8.1 Enable Cluster communications

$ sudo /usr/local/bin/togglePortsAndReplication.sh enable

8.2 Switch DNS Round Robin

Change the DNS RR so that only the active CN are in the RR.

9. Leave Read-Only Mode

This is a reverse process of entering Read-only Mode.

9.1 Start indexing for all CNs

Start up Processor and Generator

$ sudo /etc/init.d/d1-index-task-processor start
$ sudo /etc/init.d/d1-index-task-generator start

9.2 Start d1-processing on Active Master Node(s)

9.2.1 Re-activate d1-processing modules

Set all the processing components to active.

In /etc/dataone/process there are three property files:

  • logAggregation.properties
  • synchronization.properties
  • replication.properties

In logAggregation.properties set the LogAggregator.active to TRUE. In synchronization.properties set the Synchronization.active to TRUE. In replication.properties set the Replication.active to TRUE.

cd /etc/dataone/process
# to show the existing settings
grep 'active' *.properties
#
# run this command if logAggregation LogAggregator.active=FALSE
cat logAggregation.properties | tee logAggregation.properties.bak | sed 's/LogAggregator.active=FALSE/LogAggregator.active=TRUE/' > tmp.properties && mv tmp.properties logAggregation.properties
# run this command if logAggregation Synchronization.active=FALSE
cat synchronization.properties | tee synchronization.properties.bak | sed 's/Synchronization.active=FALSE/Synchronization.active=TRUE/' > tmp.properties && mv tmp.properties synchronization.properties
# run this command if logAggregation Replication.active=FALSE
cat replication.properties | tee replication.properties.bak | sed 's/Replication.active=FALSE/Replication.active=TRUE/' > tmp.properties && mv tmp.properties replication.properties
#
# to confirm new settings
grep 'active' *.properties

cat each file to make sure the settings were set.

9.2.2 Start up Processing

$ sudo /etc/init.d/d1-processing start

11. Restore DNS Round Robin

If needed, restore the DNs Round Robin to the original state.