Node Management v3.7
Each database that is member of a BDR group must be represented by its own node. A node is an unique identifier of such a database in the BDR group.
At present, each node can be a member of just one node group; this may be extended in later releases. Each node may subscribe to one or more Replication Sets to give fine-grained control over replication.
A BDR Group may also contain zero or more sub-groups, allowing a variety of different architectures to be created.
Creating and Joining a BDR Group
For BDR, every node has to have a connection to every other node. To make
configuration easy, when a new node joins, it automatically configures all
existing nodes to connect to it. For this reason, every node, including
the first BDR node created, must know the PostgreSQL connection string
(sometimes referred to as a DSN, for "data source name") that other nodes
can use to connect to it. Both formats of connection string are supported.
So you can use either key-value format, like host=myhost port=5432 dbname=mydb
,
or URI format: postgresql://myhost:5432/mydb
.
The SQL function bdr.create_node_group()
is used to create the BDR group
from the local node. Doing so activates BDR on that node and allows other
nodes to join the BDR group (which consists of only one node at that point).
At the time of creation, you must specify the connection string that other
nodes will use to connect to this node.
Once the node group is created, every further node can join the BDR
group using the bdr.join_node_group()
function.
Alternatively, use the command line utility bdr_init_physical
to
create a new node, using pg_basebackup
(or a physical standby) of an existing
node. If using pg_basebackup
, the bdr_init_physical
utility can optionally
specify the base backup of the target database only, as opposed to the earlier
behaviour of backup of the entire database cluster. This should make this activity
complete faster, and also allow it to use less space due to the exclusion of
unwanted databases. If only the target database is specified, then the excluded
databases get cleaned up and removed on the new node.
The bdr_init_physical
utility replaces the functionality of the
bdr_init_copy
utility from BDR1 and BDR2. It is the BDR3 equivalent of the
pglogical pglogical_create_subscriber
utility.
When a new BDR node is joined to an existing BDR group or a node is subscribed
to an upstream peer, before replication can begin, the system must copy the
existing data from the peer node(s) to the local node. This copy must be
carefully coordinated so that the local and remote data starts out
identical; it is not sufficient to just use pg_dump
yourself. The BDR
extension provides built-in facilities for making this initial copy.
During the join process, the BDR extension will synchronize existing data using the provided source node as the basis, and creates all metadata information needed for establishing itself in the mesh topology in the BDR group. If the connection between the source and the new node disconnects during this initial copy, the join process will need to be restarted from the beginning.
The node that is joining the cluster must not contain any schema or data
that already exists on databases in the BDR group. We recommend that the
newly joining database is empty except for the BDR and pglogical extension.
Ensure that all required database users and roles are created. However, the
schema synchronization can be optionally skipped using synchronize_structure
parameter of bdr.join_node_group()
function in which case the schema must
exist on the newly joining node already.
We recommend that the source node which has the best connection (i.e. is closest) is selected as the source node for joining, since that lowers the time needed for the join to finish.
The join procedure is coordinated using the Raft consensus algorithm, which requires most existing nodes to be online and reachable.
The logical join procedure (which uses bdr.join_node_group()
function)
performs data sync doing COPY
operations and will use multiple writers
(parallel apply) if those are enabled.
Node join can execute concurrently with other node joins for the majority of the time taken to join. However, only one regular node at a time can be in either of the states PROMOTE or PROMOTING, which are typically fairly short if all other nodes are up and running, otherwise the join will get serialized at this stage. The subscriber-only nodes are an exception to this rule, and they can be concurrently in PROMOTE and PROMOTING states as well, so their join process is fully concurrent.
Note that the join process uses only one node as the source, so can be executed when nodes are down, if a majority of nodes are available. This can cause a complexity when running logical join: During logical join, the commit timestamp of rows copied from the source node will be set to the latest commit timestamp on the source node. Committed changes on nodes that have a commit timestamp earlier than this (because nodes are down or have significant lag) could conflict with changes from other nodes; in this case, the newly joined node could be resolved differently to other nodes, causing a divergence. As a result, we recommend not to run a node join when significant replication lag exists between nodes; but if this is necessary, run LiveCompare on the newly joined node to correct any data divergence once all nodes are available and caught up.
pg_dump
may fail when there is concurrent DDL activity on the source node
because of cache lookup failures. Since bdr.join_node_group() uses pg_dump
internally, it may fail if there is concurrent DDL activity on the source node.
Retrying the join should work in such a case.
Joining a Heterogeneous Cluster
BDR 3.7 node can join a EDB Postgres Distributed cluster running 3.6.x at a specific minimum maintenance release (e.g. 3.6.25) or a mix of 3.6 and 3.7 nodes. This procedure is useful when user wants to upgrade not just the BDR major version but also the underlying PostgreSQL (or 2ndQPostgres) major version. This can be achieved by joining a 3.7 node running on PostgreSQL (or 2ndQPostgres) 12 or 13 to a EDB Postgres Distributed cluster running 3.6.x on PostgreSQL (or 2ndQPostgres) 11. Of course, the new node can also be running on the same PostgreSQL major release as all of the nodes in the existing cluster.
BDR ensures that the replication works correctly in all directions even when some nodes are running 3.6 on one PostgreSQL major release and other nodes are running 3.7 on another PostgreSQL major release. But it's recommended that the user quickly bring the cluster into a homogenous state by parting the older nodes once enough new nodes has joined the cluster. Care must be taken to not run any DDLs that might not be available on the older versions and vice versa.
A node joining with a different major PostgreSQL release cannot use
physical backup taken via bdr_init_physical
and the node must join
using the logical join method. This is necessary because the major
PostgreSQL releases are not on-disk compatible with each other.
Note that when a 3.7 node joins the cluster using a 3.6 node as a source, certain configuration such as conflict resolution configurations are not copied over from the source node. The node must be configured after it has joined the cluster.
Connection DSNs and SSL (TLS)
The DSN of a node is simply a libpq
connection string, since nodes connect
using libpq
. As such, it can contain any permitted libpq
connection
parameter, including those for SSL. Note that the DSN must work as the
connection string from the client connecting to the node in which it is
specified. An example of such a set of parameters using a client certificate is
shown here:
With this setup, the files bdr_client.crt
, bdr_client.key
and
root.crt
must be present in the data directory on each node, with the
appropriate permissions.
For verify-full
mode, the server's SSL certificate will be checked to
ensure that it is directly or indirectly signed with the root.crt
Certificate
Authority, and that the host name or address used in the connection matches the
contents of the certificate. In the case of a name, this can match a Subject
Alternative Name or, if there are no such names in the certificate, the
Subject's Common Name (CN) field.
Postgres does not currently support Subject Alternative Names for IP
addresses, so if the connection is made by address rather than name, it must
match the CN field.
The CN of the client certificate must be the name of the user making the
BDR connection.
This is usually the user postgres
. Each node will require matching
lines permitting the connection in the pg_hba.conf
file; for example:
Another setup could be to use SCRAM-SHA-256
passwords instead of client
certificates, and not bother about verifying the server identity as long as
the certificate is properly signed. Here the DSN parameters might be just:
...and the corresponding pg_hba.conf
lines would be like this:
In such a scenario, the postgres user would need a .pgpass
file
containing the correct password.
Witness Nodes
If the cluster has an even number of nodes, it may be beneficial to create an extra node to help break ties in the event of a network split (or network partition, as it is sometimes called).
Rather than create an additional full-size node, you can create a micro node, sometimes called a Witness node. This is a normal BDR node that is deliberately set up not to replicate any tables or data to it.
Logical Standby Nodes
BDR allows you to create a "logical standby node", also known as an "offload node", a "read-only node", "receive-only node" or "logical read replicas". A master node can have zero, one or more logical standby nodes.
With a physical standby node, the node never comes up fully, forcing it to
stay in continual recovery mode.
BDR allows something similar. bdr.join_node_group
has the pause_in_standby
option to make the node stay in half-way-joined as a logical standby node.
Logical standby nodes receive changes but do not send changes made locally
to other nodes.
Later, if desired, use bdr.promote_node() to move the logical standby into a full, normal send/receive node.
A logical standby is sent data by one source node, defined by the DSN in
bdr.join_node_group
. Changes from all other nodes are received from this one
source node, minimizing bandwidth between multiple sites.
There are multiple options for high availability:
If the source node dies, one physical standby can be promoted to a master. In this case, the new master can continue to feed any/all logical standby nodes.
If the source node dies, one logical standby can be promoted to a full node and replace the source in a failover operation similar to single master operation. Note that if there are multiple logical standby nodes, the other nodes cannot follow the new master, so the effectiveness of this technique is effectively limited to just one logical standby.
Note that in case a new standby is created of an existing BDR node, the necessary replication slots for operation are not synced to the new standby until at least 16 MB of LSN has elapsed since the group slot was last advanced. In extreme cases, this may require a full 16 MB before slots are synced/created on the streaming replica. If a failover or switchover occurs during this interval, the streaming standby cannot be promoted to replace its BDR node, as the group slot and other dependent slots do not exist yet.
On EDB Postgres Extended and EDB Postgres Advaced, this is resolved automatically. The slot sync-up process on the standby solves this by invoking a function on the upstream. This function moves the group slot in the entire EDB Postgres Distributed cluster by performing WAL switches and requesting all BDR peer nodes to replay their progress updates. The above causes the group slot to move ahead in a short timespan. This reduces the time required by the standby for the initial slot's sync-up, allowing for faster failover to it, if required.
On PostgreSQL, it is important to ensure that slot's sync up has completed on
the standby before promoting it. The following query can be run on the
standby in the target database to monitor and ensure that the slots have
synced up with the upstream. The promotion can go ahead when this query
returns true
.
It is also possible to nudge the slot sync-up process in the entire BDR cluster by manually performing WAL switches and by requesting all BDR peer nodes to replay their progress updates. This activity will cause the group slot to move ahead in a short timespan, and also hasten the slot sync-up activity on the standby. The following queries can be run on any BDR peer node in the target database for this:
Use the monitoring query from above on the standby to check that these queries indeed help in faster slot sync-up on that standby.
Logical standby nodes can themselves be protected using physical standby nodes, if desired, so Master->LogicalStandby->PhysicalStandby. Note that you cannot cascade from LogicalStandby to LogicalStandby.
Note that a logical standby does allow write transactions, so the restrictions of a physical standby do not apply. This can be used to great benefit, since it allows the logical standby to have additional indexes, longer retention periods for data, intermediate work tables, LISTEN/NOTIFY, temp tables, materialized views, and other differences.
Any changes made locally to logical standbys that commit before the promotion will not be sent to other nodes. All transactions that commit after promotion will be sent onwards. If you perform writes to a logical standby, you are advised to take care to quiesce the database before promotion.
You may make DDL changes to logical standby nodes but they will not be replicated, nor will they attempt to take global DDL locks. BDR functions which act similarly to DDL will also not be replicated. See [DDL Replication]. If you have made incompatible DDL changes to a logical standby, then the database is said to be a divergent node. Promotion of a divergent node will currently result in replication failing. As a result, you should plan to either ensure that a logical standby node is kept free of divergent changes if you intend to use it as a standby, or ensure that divergent nodes are never promoted.
Physical Standby Nodes
BDR also enables the creation of traditional physical standby failover nodes. These are commonly intended to directly replace a BDR node within the cluster after a short promotion procedure. As with any standard Postgres cluster, a node may have any number of these physical replicas.
There are, however, some minimal prerequisites for this to work properly due to the use of replication slots and other functional requirements in BDR:
- The connection between BDR Primary and Standby uses streaming replication through a physical replication slot.
- The Standby has:
recovery.conf
(for PostgreSQL <12, for PostgreSQL 12+ these settings should be inpostgres.conf
):primary_conninfo
pointing to the Primaryprimary_slot_name
naming a physical replication slot on the Primary to be used only by this Standby
postgresql.conf
:shared_preload_libraries = 'pglogical, bdr'
at minimumhot_standby = on
hot_standby_feedback = on
- The Primary has:
postgresql.conf
:pglogical.standby_slot_names
should specify the physical replication slot used for the Standby'sprimary_slot_name
.
While this is enough to produce a working physical standby of a BDR node, there are some additional concerns that should be addressed.
Once established, the Standby requires sufficient time and WAL traffic
to trigger an initial copy of the Primary's other BDR-related
replication slots, including the BDR group slot. At minimum, slots on a
Standby are only "live" and will survive a failover if they report
a non-zero confirmed_flush_lsn
as reported by pg_replication_slots
.
As a consequence, physical standby nodes in newly initialized BDR clusters with low amounts of write activity should be checked before assuming a failover will work normally. Failing to take this precaution can result in the Standby having an incomplete subset of required replication slots necessary to function as a BDR node, and thus an aborted failover.
The protection mechanism that ensures physical standby nodes are up to date
and can be promoted (as configured pglogical.standby_slot_names
) affects the
overal replication latency of the BDR Group as the group replication only
happens once the physical standby nodes are up to date.
For these reasons it's generally recommended to use either logical standby nodes or subscribe-only group instead of physical stanby nodes because they both have better operational characteristics in comparison.
When bdr-enterprise extension is installed, you can can manually ensure the group slot is advanced on all nodes (as much as possible), which helps hasten the creaation of BDR-related replication slots on a physical standby using the following SQL syntax:
Upon failover, the Standby must perform one of two actions to replace the Primary:
- Assume control of the same IP address or hostname as the Primary.
- Inform the EDB Postgres Distributed cluster of the change in address by executing the bdr.alter_node_interface function on all other BDR nodes.
Once this is done, the other BDR nodes will re-establish communication with the newly promoted Standby -> Primary node. Since replication slots are only synchronized periodically, this new Primary may reflect a lower LSN than expected by the existing BDR nodes. If this is the case, BDR will fast-forward each lagging slot to the last location used by each BDR node.
Take special note of the pglogical.standby_slot_names
parameter as
well. While this is a pglogical configuration parameter, it is
important to set in a EDB Postgres Distributed cluster where there is a Primary -> Physical
Standby relationship. While pglogical uses this to ensure physical
standby servers always receive WAL traffic before logical replicas,
the BDR use case is different.
BDR maintains a group slot that always reflects the state of the cluster node showing the most lag for any outbound replication. With the addition of a physical replica, BDR must be informed that there is a non-participating node member that will, regardless, affect the state of the group slot.
Since the Standby does not directly communicate with the other BDR
nodes, the standby_slot_names
parameter informs BDR to consider named
slots as necessary constraints on the group slot as well. When set, the
group slot will be held if the Standby shows lag, even if the group
slot would have normally been advanced.
As with any physical replica, this type of standby may also be configured as a synchronous replica. As a reminder, this requires:
- On the Standby:
- Specifying a unique
application_name
inprimary_conninfo
- Specifying a unique
- On the Primary:
- Enabling
synchronous_commit
- Including the Standby
application_name
insynchronous_standby_names
- Enabling
It is possible to mix physical Standby and other BDR nodes in
synchronous_standby_names
. CAMO and Eager All Node Replication use
different synchronization mechanisms and do not work with synchronous
replication. Please make sure synchronous_standby_names
does not
include the CAMO partner (if CAMO is used) or no BDR node at all (if
Eager All Node Replication is used), but only non-BDR nodes, e.g. a
the Physical Standby.
Sub-Groups
A Group may also contain zero or more sub-groups. Each sub-group can be allocated to a specific purpose within the top-level parent group. The node_group_type specifies the type when the sub-group is created.
Subscriber-Only Groups
BDR 3.7.5 and above supports a new kind of node, called
subscriber-only
node. As the name suggests, this type of node only
subscribes to replication changes from other nodes in the cluster, but
no other nodes receive replication changes from subscriber-only
nodes. This is somewhat similar to Logical Standby nodes, but in contrast
to Logical Standby, the subscriber-only
nodes are fully joined node to
the cluster. They can receive replication changes from all other nodes
in the cluster and hence they are not impacted by unavailability or
parting of any one node in the cluster.
Also unlike pglogical
node, a subscriber-only
node is a fully joined
BDR node and hence it receives all replicated DDLs and acts on those. It
also uses Raft to consistently report its status to all nodes in the
cluster. The subscriber-only
node does not have Raft voting rights and
hence neither can become a Raft leader nor participate in the leader
election. Also, while it receives replicated DDLs, it does not
participate in DDL or DML lock acquisition. In other words, a currently
down subscriber-only
node won't stop a DML lock being acquired.
The subscriber-only
node forms the building block for BDR Tree
topology. In this topology, there are a small number of fully active
nodes, replicating changes in all directions, and there are a large
number of subscriber-only
nodes that only receive changes, but never
send any changes to any other node in the cluster. This topology avoids
connection explosion caused due to a large number of nodes, yet provide
extremely large number of leaf
nodes that can be used to consume the
data.
In order to make use of subscriber-only
nodes, the user must first
create a BDR group of type 'subscriber-only'. It should be a subgroup of
the group from which the member nodes will receive the replication
changes. Once the subgroup is created, all nodes that intend to become
subscriber-only
nodes should join the subgroup. More than one
subgroup of 'subscriber-only' type can be created and they can have
different parent groups.
Once a node successfully joins the 'subscriber-only' subgroup, it will
become a subscriber-only
node and start receiving replication changes
for the parent group. Any changes made directly on the subscriber-only
node will not be replicated.
See bdr.create_node_group()
to know how to create a subgroup of a
specific type and belonging to a specific parent group.
Notes
Since a subscriber-only
node doesn't replicate changes to any node in
the cluster, it can't act as a source for syncing replication changes
when a node is parted from the cluster. But if the subscriber-only
node had already received and applied replication changes from the
parted node that no other node in the cluster currently has, then that
will cause inconsistency between the nodes.
For now, this can be solved by setting pglogical.standby_slot_names
and pglogical.standby_slots_min_confirmed
appropriately so that there
is always a fully active BDR node that is ahead of the subscriber-only
nodes. See pglogical documentation to show how to use these
configuration parameters effectively.
This will be improved in a future release. We may either allow
subscriber-only
nodes to be ahead in the replication and then use them
as replication source for sync or simply provide ways to optionally
remove the inconsistent subscriber-only
nodes from the cluster when
another fully joined node is parted. RM20306 tracks the development task.
Decoding Worker
BDR3.7 provides an option to enable decoding worker process that performs decoding once, no matter how many nodes are being sent data. This introduces a new process, the wal decoder, on each BDR node. One WAL Sender process still exists for each connection, but these processes now just perform the task of sending and receiving data. Taken together these changes reduce the CPU overhead of larger BDR groups and also allow higher replication throughput since the WAL Sender process now spends more time on communication.
enable_wal_decoder
is an option for each BDR group, which is currently
disabled by default. bdr.alter_node_group_config()
can be used to enable or
disable the Decoding Worker for a BDR group.
When the Decoding Worker is enabled, BDR stores Logical Change Record
(LCR, in
short) files to allow buffering of changes between decoding and when all
subscribing nodes have received data. LCR files are stored under the
pg_logical
directory within each local node's data directory. The number and
size of the LCR files will vary as replication lag increases, so this will also
need monitoring. The LCRs not required by any of the BDR nodes are cleaned
periodically. The interval between two consecutive cleanups is controlled by
bdr.lcr_cleanup_interval
, which defaults to 3 minutes. The cleanup is
disabled when bdr.lcr_cleanup_interval
is zero.
When disabled, logical decoding is performed by the WAL Sender process for each node subscribing to each node. In this case, no LCR files are written.
Even though the Decoding Worker is enabled for a BDR group, following
GUCs control the production and usage of LCR per node. By default
these are false
. For production and usage of LCRs we need the
Decoding Worker to be enabled for the BDR group and these GUCs to be
set to true
on each of the nodes in BDR group.
pglogical.enable_wal_decoder
- when turnedfalse
, all WAL Senders using LCRs are restarted to use WAL directly. Whentrue
along with the BDR group config, a Decoding Worker process is started to produce LCR and WAL Senders use LCR.bdr.receive_lcr
- whentrue
on the subscribing node, it requests WAL Sender on the publisher node to use LCRs if available.
Notes
As of now, a Decoding Worker decodes changes corresponding to the node where it is running. A Logical standby is sent changes from all the nodes in BDR group through a single source. Hence a WAL sender serving a Logical standby can not use LCRs right now.
A Subscriber-only node receives changes from respective nodes directly and hence a WAL sender serving a Subscriber-only node can use LCRs.
Even though LCRs are produced, the corresponding WALs are still retained similar to the case when Decoding Worker is not enabled. In future, it may be possible to remove WAL corresponding the LCRs, if they are not required otherwise.
For reference, the first 24 characters of an LCR filename are similar to those in a WAL filename. The first 8 characters of the name are all '0' right now. In future they are expected to represent the TimeLineId similar to the first 8 characters of a WAL segment filename. The following sequence of 16 characters of the name is similar to the WAL segment number which is used to track LCR changes against the WAL stream. However, please note that logical changes are reordered according to the commit order of the transactions they below to. Hence their placement in the LCR segments does not match the placement of corresponding WAL in the WAL segments. The set of last 16 characters represents sub-segment number within an LCR segment. Each LCR file corresponds to a sub-segment. LCR files are binary and variable sized. The maximum size of an LCR file can be controlled by pglogical.max_lcr_segment_file_size, which defaults to 1GB.
EDB Postgres Extended 13 and above is required for this feature to work.
Node Restart and Down Node Recovery
BDR is designed to recover from node restart or node disconnection. The disconnected node will automatically rejoin the group by reconnecting to each peer node and then replicating any missing data from that node.
When a node starts up, each connection will begin showing
bdr.node_slots.state
= catchup
and begin replicating missing data.
Catching-up will continue for a period of time that depends upon the
amount of missing data from each peer node, which will likely increase
over time, depending upon the server workload.
If the amount of write activity on each node is not uniform, the catchup period
from nodes with more data could take significantly longer than other nodes.
Eventually, the slot state will change to bdr.node_slots.state
= streaming
.
Nodes that are offline for longer periods of time, such as hours or days, can begin to cause resource issues for various reasons. Users should not plan on extended outages without understanding the following issues.
Each node retains change information (using one
replication slot
for each peer node) so it can later replay changes to a temporarily unreachable node.
If a peer node remains offline indefinitely, this accumulated change information
will eventually cause the node to run out of storage space for PostgreSQL
transaction logs (WAL in pg_wal
), and will likely cause the database server
to shut down with an error similar to this:
...or report other out-of-disk related symptoms.
In addition, slots for offline nodes also hold back the catalog xmin, preventing vacuuming of catalog tables.
On EDB Postgres Extended, offline nodes also hold back freezing of data to prevent losing conflict resolution data (see: Origin Conflict Detection).
Administrators should monitor for node outages (see: monitoring) and make sure nodes have sufficient free disk space. If the workload is predictable, it may be possible to calculate how much space is used over time, allowing a prediction of the maximum time a node can be down before critical issues arise.
Replication slots created by BDR must not be removed manually. Should that happen, the cluster is damaged and the node that was using the slot must be parted from the cluster, as described below.
Note that while a node is offline, the other nodes may not yet have received the same set of data from the offline node, so this may appear as a slight divergence across nodes. This imbalance across nodes is corrected automatically during the parting process. Later versions may do this at an earlier time.
Replication Slots created by BDR
On a BDR master node, the following replication slots are created automatically:
- One group slot, named
bdr_<database name>_<group name>
; - N-1 node slots, named
bdr_<database name>_<group name>_<node name>
, where N is the total number of BDR nodes in the cluster, including direct logical standbys, if any.
The user must not drop those slots: they have been created automatically by BDR, and will be managed by BDR, which will drop them when/if necessary.
On the other hand, replication slots required by software like Barman
or pglogical can be created or dropped, using the appropriate commands
for the software, without any effect on BDR.
Ensure that slot names used by other software do not begin with the
prefix bdr_
.
For example, in a cluster composed by 3 nodes alpha
, beta
and
gamma
, where BDR is used to replicate the mydb
database, and the
BDR group is called mygroup
:
- Node
alpha
has three slots:- One group slot named
bdr_mydb_mygroup
- Two node slots named
bdr_mydb_mygroup_beta
andbdr_mydb_mygroup_gamma
- One group slot named
- Node
beta
has three slots:- One group slot named
bdr_mydb_mygroup
- Two node slots named
bdr_mydb_mygroup_alpha
andbdr_mydb_mygroup_gamma
- One group slot named
- Node
gamma
has three slots:- One group slot named
bdr_mydb_mygroup
- Two node slots named
bdr_mydb_mygroup_alpha
andbdr_mydb_mygroup_beta
- One group slot named
Group Replication Slot
The group slot is an internal slot used by BDR primarily to track what's the oldest safe position that any node in the BDR group (including all logical standbys) has caught up to, for any outbound replication from this node.
The group slot name is given by the function bdr.local_group_slot_name()
.
The group slot can:
- join new nodes to the BDR group without having all existing nodes up and running (although the majority of nodes should be up), without incurring data loss in case the node which was down during join starts replicating again
- part nodes from cluster consistently, even if some nodes have not caught up fully with the parted node
- hold back the freeze point to avoid missing some conflicts (EDB Postgres Extended)
- keep the historical snapshot for timestamp based snapshots (EDB Postgres Extended)
The group slot is usually inactive, and is only fast-forwarded periodically in response to Raft progress messages from other nodes.
WARNING: Do not drop the group slot. Although usually inactive, it is still vital to the proper operation of the EDB Postgres Distributed cluster. If it is dropped, then some or all of the above features will stop working and/or may have incorrect outcomes.
Hashing Long Identifiers
Note that the name of a replication slot - like any other PostgreSQL identifier - cannot be longer than 63 bytes; BDR handles this by shortening the database name, the BDR group name and the name of the node, in case the resulting slot name is too long for that limit. The shortening of an identifier is carried out by replacing the final section of the string with a hash of the string itself.
As an example of this, consider a cluster that replicates a database
named db20xxxxxxxxxxxxxxxx
(20 bytes long) using a BDR group named
group20xxxxxxxxxxxxx
(20 bytes long); the logical replication slot
associated to node a30xxxxxxxxxxxxxxxxxxxxxxxxxxx
(30 bytes long)
will be called:
...since 3597186
, be9cbd0
and 7f304a2
are respectively the hashes
of db20xxxxxxxxxxxxxxxx
, group20xxxxxxxxxxxxx
and
a30xxxxxxxxxxxxxxxxxxxxxxxxxx
.
Removing a Node From a BDR Group
Since BDR is designed to recover from extended node outages, you must explicitly tell the system if you are removing a node permanently. If you permanently shut down a node and do not tell the other nodes, then performance will suffer, and eventually the whole system will stop working.
Node removal, also called parting, is done using the bdr.part_node()
function. You must specify the node name (as passed during node creation)
to remove a node. The bdr.part_node()
function can be called from any active
node in the BDR group, including the node that is being removed.
Just like the join procedure, parting is done using Raft consensus and requires a majority of nodes to be online to work.
The parting process affects all nodes. The Raft leader will manage a vote between nodes to see which node has the most recent data from the parting node. Then all remaining nodes will make a secondary, temporary, connection to the most-recent node to allow them to catch up any missing data.
A parted node still is known to BDR, but won't consume resources. A
node may well be re-added under the very same name as a parted node.
In rare cases, it may be advisable to clear all metadata of a parted
node with the function bdr.drop_node()
.
Uninstalling BDR
Dropping the BDR extension will remove all the BDR objects in a node, including metadata tables. This can be done with the following command:
If the database depends on some BDR-specific objects, then the BDR extension cannot be dropped. Examples include:
- Tables using BDR-specific sequences such as timeshard or galloc
- Column using CRDT data types
- Views that depend on some BDR catalog tables
Those dependencies must be removed before dropping the BDR extension,
for instance by dropping the dependent objects, altering the column
type to a non-BDR equivalent, or changing the sequence type back to
local
.
Warning
Dropping the BDR extension must only be performed if the node has been successfully parted from its BDR node group, or if it is the last node in the group: dropping BDR and pglogical metadata will break replication to/from the other nodes.
Warning
When dropping a local BDR node, or the BDR extension in the local
database, any preexisting session might still try to execute a BDR
specific workflow, and therefore fail. The problem can be solved
by disconnecting the session and then reconnecting the client, or
by restarting the instance.
Moreover, the "could not open relation with OID (...)" error could
occur when
(1) parting a node from a EDB Postgres Distributed cluster, then
(2) dropping the BDR extension
(3) recreating it, and finally
(4) running pglogical.replication_set_add_all_tables()
.
Restarting the instance will solve the problem.
Similar considerations apply to the pglogical extension, which is required by BDR.
If pglogical is only used by BDR, then it is possible to drop both extensions with a single statement:
Conversely, if the node is also using pglogical independently of BDR, e.g. for one-way replication of some tables to a remote database, then only the BDR extension should be dropped.
Warning
Dropping BDR from a database that independently uses pglogical can block an existing pglogical subscription from working further with the "BDR global lock manager not initialized yet" error. Restarting the instance will solve the problem.
There is also a bdr.drop_node()
function, but this is used only in
emergencies, should there be a problem with parting.
Listing BDR Topology
Listing BDR Groups
The following (simple) query lists all the BDR node groups of which the current node is a member (will currently return one row only):
The configuration of each node group can be displayed using a more complex query:
Listing Nodes in a BDR Group
The list of all nodes in a given node group (e.g. mygroup
) can be
extracted from the bdr.node_summary
view as shown in the following
example:
Note that the read-only state of a node, as shown in the
current_state
or in the target_state
query columns, is indicated
as STANDBY
.
List of Node States
NONE
: Node state is unset when the worker starts, expected to be set quickly to the current known state.CREATED
:bdr.create_node()
has been executed, but the node is not a member of any EDB Postgres Distributed cluster yet.JOIN_START
:bdr.join_node_group()
begins to join the local node to an existing EDB Postgres Distributed cluster.JOINING
: The node join has started and is currently at the initial sync phase, creating the schema and data on the node.CATCHUP
: Initial sync phase is completed; now the join is at the last step of retrieving and applying transactions that were performed on the upstream peer node since the join started.STANDBY
: Node join has finished, but not yet started to broadcast changes. All joins spend some time in this state, but if defined as a Logical Standby, the node will continue in this state.PROMOTE
: Node was a logical standby and we just called bdr.promote_node to move the node state toACTIVE
. These twoPROMOTE
states have to be coherent to the fact, that only one node can be with a state higher thanSTANDBY
but lower thanACTIVE
.PROMOTING
: Promotion from logical standby to full BDR node is in progress.ACTIVE
: The node is a full BDR node and is currentlyACTIVE
. This is the most common node status.PART_START
: Node wasACTIVE
orSTANDBY
and we just called bdr.part_node to remove the node from the EDB Postgres Distributed cluster.PARTING
: Node disconnects from other nodes and plays no further part in consensus or replication.PART_CATCHUP
: Non-parting nodes synchronize any missing data from the recently parted node.PARTED
: Node parting operation is now complete on all nodes.
Only one node at a time can be in either of the states PROMOTE or PROMOTING.
Node Management Interfaces
Nodes can be added and removed dynamically using the SQL interfaces.
bdr.create_node
This function creates a node.
Synopsis
Parameters
node_name
- name of the new node; only one node is allowed per database. Valid node names consist of lower case letters, numbers, hyphens and underscores.local_dsn
- connection string to the node
Notes
This function just creates a record for the local node with the associated public connection string. There can be only one local record, so once it's created, the function will error if run again.
This function is a transactional function - it can be rolled back and the changes made by it are visible to the current transaction.
The function will hold lock on the newly created bdr node until the end of the transaction.
bdr.drop_node
Drops a node. This function is not intended for regular use and should only be executed under the instructions of Technical Support.
This function removes the metadata for a given node from the local database. The node can be either:
- The local node, in which case all the node metadata is removed, including information about remote nodes;
- A remote node, in which case only metadata for that specific node is removed.
Synopsis
Parameters
node_name
- Name of an existing node.cascade
- Whether to cascade to dependent objects, this will also delete the associated pglogical node. This option should be used with caution!force
- Circumvents all sanity checks and forces the removal of all metadata for the given BDR node despite a possible danger of causing inconsistencies. A forced node drop is to be used by Technical Support only in case of emergencies related to parting.
Notes
Before you run this, you should already have parted the node using bdr.part_node()
.
This function removes metadata for a given node from the local database. The node can be either the local node, in which case all the node metadata are removed, including info about remote nodes are removed; or it can be the remote node, in which case only metadata for that specific node is removed.
Note
BDR3 can have a maximum of 1024 node records (both ACTIVE and PARTED) at one time. This is because each node has a unique sequence number assigned to it, for use by timeshard sequences. PARTED nodes are not automatically cleaned up at the moment; should this become a problem, this function can be used to remove those records.
bdr.create_node_group
This function creates a BDR group with the local node as the only member of the group.
Synopsis
Parameters
node_group_name
- Name of the new BDR group; as with the node name, valid group names must consist of lower case letters, numbers and underscores, exclusively.parent_group_name
- The name of the parent group for the subgroup.join_node_group
- This helps a node to decide whether or not to join the group being created by it. The default value is true. This is used when a node is creating a shard group that it does not want to join. This can be false only if parent_group_name is specified.node_group_type
- The valid values are NULL, 'subscriber-only', 'datanode', 'read coordinator' and 'write coordinator'. 'subscriber-only' type is used to create a group of nodes that only receive changes from the fully joined nodes in the cluster, but they never send replication changes to other nodes. See [Subscriber-Only Nodes] for more details. Datanode implies that the group represents a shard, whereas the other values imply that the group represents respective coordinators. Except 'subscriber-only', the rest three values are reserved for future use. NULL implies a normal general purpose node group will be created.
Notes
This function will pass request to local consensus worker that is running for the local node.
The function is not transactional. The creation of the group is a background
process, so once the function has finished, the changes cannot be rolled back.
Also, the changes might not be immediately visible to the current transaction;
the bdr.wait_for_join_completion
can be called to wait until they are.
The group creation does not hold any locks.
bdr.alter_node_group_config
This function changes the configuration parameter(s) of an existing BDR group. Options with NULL value (default for all of them) will not be modified.
Synopsis
Parameters
node_group_name
- Name of an existing BDR group; the local node must be part of the group.insert_to_update
- Reserved for backwards compatibility reasons.update_to_insert
- Reserved for backwards compatibility reasons. versions of BDR. Usebdr.alter_node_set_conflict_resolver
instead.**ignore_redundant_updates
- Reserved for backwards compatibility reasons.check_full_tuple
- Reserved for backwards compatibility reasons.apply_delay
- Reserved for backwards compatibility reasonscheck_constraints
- Whether the apply process will check the constraints when writing replicated data. This option is deprecated and may be disabled or removed in future versions of BDR.num_writers
- number of parallel writers for subscription backing this node group, -1 means the default (as specified by the pglogical GUC pglogical.writers_per_subscription) will be used. Valid values are either -1 or a positive integer.enable_wal_decoder
- Enables/disables the Decoding Worker process.
Note that all of the options parameters are simply used to control the pglogical writer.
Notes
This function will pass a request to the group consensus mechanism to change the defaults. The changes made are replicated globally via the consensus mechanism.
The function is not transactional. The request is processed in the background so the function call cannot be rolled back. Also, the changes may not be immediately visible to the current transaction.
This function does not hold any locks.
Warning
When this function is used to change the apply_delay
value, the
change does not apply to nodes that are already members of the
group.
Note that this restriction has little consequence on production
usage, because this value is normally not used outside of testing.
bdr.join_node_group
This function joins the local node to an already existing BDR group.
Synopsis
Parameters
join_target_dsn
- Specifies the connection string to existing (source) node in the BDR group we wish to add the local node to.node_group_name
- Optional name of the BDR group; defaults to NULL which tries to autodetect the group name from information present on the source node.pause_in_standby
- Optionally tells the join process to only join as a logical standby node, which can be later promoted to a full member.wait_for_completion
- Wait for the join process to complete before returning; defaults to true.synchronize_structure
- Set what kind of structure (schema) synchronization should be done during the join. Valid options are 'all' which synchronizes the complete database structure, and 'none' which will not synchronize any structure (however, it will still synchronize data).
If wait_for_completion
is specified as false;
this is an asynchronous call which returns as soon as the joining procedure
has started. Progress of the join can be seen in logs and the
bdr.state_journal_details
information view, or by calling the
bdr.wait_for_join_completion()
function once bdr.join_node_group()
returns.
Notes
This function will pass a request to the group consensus mechanism via the node
that the join_target_dsn
connection string points to.
The changes made are replicated globally via the consensus mechanism.
The function is not transactional. The joining process happens in the
background and as such cannot be rolled back. The changes are only visible
to the local transaction if wait_for_completion
was set to true
or by calling
bdr.wait_for_join_completion
later.
Node can only be part of a single group, so this function can only be called once on each node.
Node join does not hold any locks in the BDR group.
bdr.promote_node
This function promotes a local logical standby node to a full member of the BDR group.
Synopsis
Notes
This function will pass a request to the group consensus mechanism to change the defaults. The changes made are replicated globally via the consensus mechanism.
The function is not transactional. The promotion process happens in the
background, and as such cannot be rolled back. The changes are only visible
to the local transaction if wait_for_completion
was set to true
or by calling
bdr.wait_for_join_completion
later.
The promotion process holds lock against other promotions. This lock will not
block other bdr.promote_node
calls, but will prevent the background process of
promotion from moving forward on more than one node at a time.
bdr.wait_for_join_completion
This function waits for the join procedure of a local node to finish.
Synopsis
Parameters
verbose_progress
- Optionally prints information about individual steps taken during the join procedure.
Notes
This function waits until the checks state of the local node reaches the target
state, which was set by bdr.create_node_group
, bdr.join_node_group
or
bdr.promote_node
.
bdr.part_node
Removes ("parts") the node from the BDR group, but does not remove data from the node.
The function can be called from any active node in the BDR group, including the node which is being removed. However, just to make it clear, once the node is PARTED it can not part other nodes in the cluster.
Note
If you are parting the local node you must set wait_for_completion
to false, otherwise it will error.
Warning
This action is permanent. If you wish to temporarily halt replication
to a node, see bdr.alter_subscription_disable()
.
Synopsis
Parameters
node_name
- Name of an existing node to part.wait_for_completion
- If true, the function will not return until the node is fully parted from the cluster, otherwise the function will just start the parting procedure and return immediately without waiting. Always set to false when executing on the local node, or when using force.force
- Forces removal of the node on the local node. This will set the node state locally if consensus could not be reached or if the node parting process has stuck.
Warning
Using force = true
may leave the BDR group in a inconsistent
state and should be only used to recover from byzantine failures where it's
impossible to remove the node any other way.**
Notes
This function will pass a request to the group consensus mechanism to part
the given node. The changes made are replicated globally via the consensus
mechanism. The parting process happens in the background, and as such cannot
be rolled back. The changes made by the parting process are only visible to
the local transaction if wait_for_completion
was set to true
.
With force
set to true
, on consensus failure, this function will set the
state of the given node only on the local node. In such a case, the function is
transactional (because the function itself changes the node state) and can be
rolled back. If the function is called on a node which is already in process of
parting with force
set to true
, it will also just mark the given node as
parted locally and exit. This is only useful when the consensus cannot be
reached on the cluster (i.e. the majority of the nodes are down) or if the
parting process gets stuck for whatever reason. But it is important to take into
account that when the parting node that was receiving writes, the parting process
may take a long time without being stuck, as the other nodes need to resynchronize
any missing data from the given node. The force parting completely skips this
resynchronization, and as such can leave the other nodes in inconsistent state.
The parting process does not hold any locks.
bdr.alter_node_interface
This function changes the connection string (DSN
) of a specified node.
Synopsis
Parameters
node_name
- name of an existing node to alterinterface_dsn
- new connection string for a node
Notes
This function is only run on the local node and the changes are only made on the local node. This means that it should normally be executed on every node in the BDR group, including the node which is being changed.
This function is transactional - it can be rolled back, and the changes are visible to the current transaction.
The function holds lock on the local node.
bdr.alter_subscription_enable
This function enables either the specified subscription or all the subscriptions of the local BDR node. Also known as resume subscription. No error is thrown if the subscription is already enabled. Returns the number of subscriptions affected by this operation.
Synopsis
Parameters
subscription_name
- Name of the subscription to enable; if NULL (the default), all subscriptions on the local node will be enabled.immediate
- This currently has no effect.
Notes
This function is not replicated and only affects local node subscriptions (either a specific node or all nodes).
This function is transactional - it can be rolled back and any catalog changes can be seen by the current transaction. The subscription workers will be started by a background process after the transaction has committed.
bdr.alter_subscription_disable
This function disables either the specified subscription or all the subscriptions of the local BDR node. Optionally, it can also immediately stop all the workers associated with the disabled subscriptions. Also known as pause subscription. No error is thrown if the subscription is already disabled. Returns the number of subscriptions affected by this operation.
Synopsis
Parameters
subscription_name
- Name of the subscription to disable; if NULL (the default), all subscriptions on the local node will be disabled.immediate
- Immediate is used to force the action immediately, stopping all the workers associated with the disabled subscription. With this option true, this function cannot be run inside of the transaction block.
Notes
This function is not replicated and only affects local node subscriptions (either a specific subscription or all subscriptions).
This function is transactional - it can be rolled back and any catalog changes
can be seen by the current transaction. However, the timing of the subscription
worker stopping depends on the value of immediate
; if set to true
, the
workers will be stopped immediately; if set to false
, they will be stopped at
the COMMIT
time.
Note
With the parameter immediate
set to true
, the stop will however wait
for the workers to finish current work.
Node Management Commands
BDR also provides a command line utility for adding nodes to the BDR group via
physical copy (pg_basebackup
) of an existing node, and for converting a
physical standby of an existing node to a new node in the BDR group.
bdr_init_physical
This is a regular command which is added to PostgreSQL's bin directory.
The user must specify a data directory. If this data directory is empty,
the pg_basebackup -X stream
command is used to fill the directory
using a fast block-level copy operation.
When starting from an empty data directory, if the selective backup option is chosen, then only that database will be copied from the source node. The excluded databases will be dropped and cleaned up on the new node (EDB Postgres Extended).
If the specified data directory is non-empty, this will be used as the
base for the new node. If the data directory is already active as a
physical standby node, it is required to stop the standby before running
bdr_init_physical
, which will manage Postgres itself. Initially it will
wait for catchup and then promote to a master node before joining the BDR
group. Note that the --standby
option, if used, will turn the existing
physical standby into a logical standby node; it refers to the end state
of the new BDR node, not the starting state of the specified data directory.
This command will drop all pglogical-only subscriptions and configuration from
the database and will also drop all PostgreSQL native logical replication
subscriptions from the database (or just disable them when the -S
option is
used), as well as any replication origins and slots.
It is the BDR3 version of the pglogical_create_subscriber
utility.
Synopsis
Options
General Options
-D, --pgdata=DIRECTORY
- The data directory to be used for the new node; it can be either empty/non-existing directory, or a directory populated using thepg_basebackup -X stream
command (required).-l, --log-file=FILE
- Use FILE for logging; default is bdr_init_physical_postgres.log .-n, --node-name=NAME
- The name of the newly created node (required).--replication-sets=SETS
- The name of a comma-separated list of replication set names to use; all replication sets will be used if not specified.--standby
- Create a logical standby (receive only node) rather than full send/receive node.--node-group-name
- Group to join, defaults to the same group as source node.-s, --stop
- Stop the server once the initialization is done.-v
- Increase logging verbosity.-L
- Perform selective pg_basebackup when used in conjunction with an empty/non-existing data directory (-D option). (EDB Postgres Extended)-S
- Instead of dropping logical replication subscriptions, just disable them.
Connection Options
-d, --remote-dsn=CONNSTR
- connection string for remote node (required)--local-dsn=CONNSTR
- connection string for local node (required)
Configuration Files Override
--hba-conf -path
to the new pg_hba.conf--postgresql-conf
- path to the new postgresql.conf--postgresql-auto-conf
- path to the new postgresql.auto.conf
Notes
The replication set names specified in the command do not affect the data that
exists in the data directory before the node joins the BDR group. This is true
whether bdr_init_physical makes its own basebackup or an existing base backup
is being promoted to a new BDR node. Thus the --replication-sets
option only
affects the data published and subscribed-to after the node joins the BDR node
group. This behaviour is different from the way replication sets are used in a
logical join i.e. when using bdr.join_node_group()
.
Unwanted tables may be truncated by the operator after the join has completed.
Refer to the bdr.tables
catalog to determine replication set membership and
identify tables that are not members of any subscribed-to replication set. It's
strongly recommended that you truncate the tables rather than drop them, because:
- DDL replication sets are not necessarily the same as row (DML) replication sets, so you could inadvertently drop the table on other nodes;
- If you later want to add the table to a replication set and you have dropped it on some subset of nodes, you will need to take care to re-create it only on those nodes without creating DDL conflicts before you can add it to any replication sets.
It's much simpler and safer to truncate your non-replicated tables, leaving them present but empty.
A future version of BDR may automatically omit or remove tables that are not part of the selected replication set(s) for a physical join, so your application should not rely on details of the behaviour documented here.