public class RaftKVDatabase extends Object implements KVDatabase
KVDatabase based on the Raft consensus algorithm.
Raft defines a distributed consensus algorithm for maintaining a shared state machine. Each Raft node maintains a complete copy of the state machine. Cluster nodes elect a leader who collects and distributes updates and provides for consistent reads. As long as as a node is part of a majority, the state machine is fully operational.
RaftKVDatabase turns this into a transactional key/value database with linearizable ACID semantics.
Implementation Details
SnapshotKVDatabase):
MutableView to collect mutations.
The MutableView is based on the local node's most recent log entry
(whether committed or not); this is called the base term and index for the transaction.Reads, Writes,
base index and term, and any config change are sent to the leader.RetryTransactionException.
RetryTransactionException.
Writes associated with log entries after the transaction's base log entry
do not create conflicts when compared against the transaction's
Reads. If so, the transaction is rejected with a RetryTransactionException.Writes (and any config change) to its log.
The associated term and index become the transaction's commit term and index; the leader then
replies to the follower with this information.AppendRequest's to a majority of followers
who have since responded, plus the minimum election timeout, minus a small adjustment
for possible clock drift (this assumes all nodes have the same minimum election timeout configured). If the current
time is prior to the leader lease timeout, the transaction may be committed as soon as log entry corresponding to the
commit term and index is committed (it may already be); otherwise, the current time is returned to the follower
as minimum required leader lease timeout before the transaction may be committed.AppendRequest includes the leader's current timestamp and leader lease timeout, so followers can commit
any waiting read-only transactions. Leaders keep track of which followers are waiting on which leader lease
timeout values, and when the leader lease timeout advances to allow a follower to commit a transaction, the follower
is immediately notified.RaftKVTransaction.setConsistency().Limitations
AtomicKVStore is required to store local persistent state.In general, the algorithm should function correctly under all non-Byzantine conditions. The level of difficultly the system is experiencing, due to contention, network errors, etc., can be measured in terms of:
RetryTransactionException'sCluster Configuration
Instances support dynamic cluster configuration changes at runtime.
Initially, all nodes are in an unconfigured state, where nothing has been added to the Raft log yet and no cluster is defined. Unconfigured nodes are passive: they stay in follower mode (i.e., they will not start elections), and they disallow local transactions that make any changes other than as described below to create a new cluster.
An unconfigured node becomes configured when either:
RaftKVTransaction.configChange() is invoked and committed within
a local transaction, which creates a new single node cluster and commits the first log entry; orAppendRequest is received from a leader of some existing cluster, in which case the node
records the cluster ID thereby joining the cluster (see below), and applies the received cluster configuration.A node is configured if and only if it has recorded one or more log entries. The very first log entry always contains the initial cluster configuration (containing only the node that created it, whether local or remote), so any node that has a non-empty log is configured.
Newly created clusters are assigned a random 32-bit cluster ID (option #1 above). This ID is included in all messages sent over the network, and adopted by unconfigured nodes that join the cluster (via option #2 above). Configured nodes discard incoming messages containing a cluster ID different from the one they have joined. This prevents data corruption that can occur if nodes from two different clusters are inadvertently "mixed" together.
Once a node joins a cluster with a specific cluster ID, it cannot be reassigned to a different cluster without first returning it to the unconfigured state; to do that, it must be shut it down and its persistent state deleted.
Configuration Changes
Once a node is configured, a separate issue is whether the node is included in its own configuration, i.e., whether
the node is a member of its cluster according to the current cluster configuration. A node that is not a member of its
cluster does not count its own vote to determine committed log entries (if a leader), and does not start elections
(if a follower). However, it will accept and respond to incoming AppendRequests and RequestVotes.
In addition, leaders follow these rules with respect to configuration changes:
AppendRequests.AppendRequests
until the follower acknowledges receipt of the log entry containing the configuration change.Follower Probes
This implementation includes a modification to the Raft state machine to avoid unnecessary, disruptive elections when a node or nodes is disconnected from, and then reconnected to, the majority.
When a follower's election timeout fires, before converting into a candidate, the follower is required to verify
communication with a majority of the cluster using PingRequest messages. Only when the follower has
successfully done so may it become a candidate. While in this intermediate "probing" mode, the follower responds
normally to incoming messages. In particular, if the follower receives a valid AppendRequest from the leader, it
reverts back to normal operation.
This behavior is optional, but enabled by default (see setFollowerProbingEnabled());
Key Watches
Key watches and mutable snapshots are supported.
Spring Isolation Levels
In Spring applications, the transaction Consistency level may be configured through the Spring
org.jsimpledb.spring.JSimpleDBTransactionManager by (ab)using the transaction isolation level setting,
for example, via the @Transactional annotation's
isolation() property.
All Raft consistency levels are made available this way, though the mapping from Spring's isolation levels to
RaftKVDatabase's consistency levels is only semantically approximate:
| Spring isolation level | RaftKVDatabase consistency level |
|---|---|
Isolation |
Consistency.LINEARIZABLE |
Isolation |
Consistency.LINEARIZABLE |
Isolation |
Consistency.EVENTUAL |
Isolation |
Consistency.EVENTUAL_COMMITTED |
Isolation |
Consistency.UNCOMMITTED |
| Modifier and Type | Field and Description |
|---|---|
static int |
DEFAULT_COMMIT_TIMEOUT
Default transaction commit timeout (5000).
|
static int |
DEFAULT_HEARTBEAT_TIMEOUT
Default heartbeat timeout (200ms).
|
static int |
DEFAULT_MAX_ELECTION_TIMEOUT
Default maximum election timeout (1000ms).
|
static int |
DEFAULT_MAX_TRANSACTION_DURATION
Default maximum supported outstanding transaction duration (5000ms).
|
static long |
DEFAULT_MAX_UNAPPLIED_LOG_MEMORY
Default maximum supported applied log entry memory usage (104857600L bytes).
|
static int |
DEFAULT_MIN_ELECTION_TIMEOUT
Default minimum election timeout (750ms).
|
static int |
DEFAULT_TCP_PORT
Default TCP port (9660) used to communicate with peers.
|
static String |
OPTION_CONSISTENCY
Option key for
createTransaction(Map). |
| Constructor and Description |
|---|
RaftKVDatabase() |
| Modifier and Type | Method and Description |
|---|---|
RaftKVTransaction |
createTransaction()
Create a new transaction.
|
RaftKVTransaction |
createTransaction(Consistency consistency)
Create a new transaction with the specified consistency.
|
RaftKVTransaction |
createTransaction(Map<String,?> options) |
int |
getClusterId()
Retrieve the unique 32-bit ID for this node's cluster.
|
long |
getCommitIndex()
Get this instance's current commit index..
|
int |
getCommitTimeout()
Get the configured default transaction commit timeout.
|
Map<String,String> |
getCurrentConfig()
Retrieve the current cluster configuration as understood by this node.
|
Role |
getCurrentRole()
Get this instance's current role: leadeer, follower, or candidate.
|
long |
getCurrentTerm()
Get this instance's current term.
|
long |
getCurrentTermStartTime()
Get the time at which this instance's current term advanced to its current value.
|
int |
getHeartbeatTimeout()
Get the configured heartbeat timeout.
|
String |
getIdentity()
Get this node's Raft identity.
|
long |
getLastAppliedIndex()
Get this instance's last applied log entry index.
|
long |
getLastAppliedTerm()
Get this instance's last applied log entry term.
|
File |
getLogDirectory()
Get the directory in which uncommitted log entries are stored.
|
int |
getMaxElectionTimeout()
Get the configured maximum election timeout.
|
int |
getMaxTransactionDuration()
Get the configured maximum supported duration for outstanding transactions.
|
long |
getMaxUnappliedLogMemory()
Get the configured maximum allowed memory used for unapplied log entries.
|
int |
getMinElectionTimeout()
Get the configured minimum election timeout.
|
List<RaftKVTransaction> |
getOpenTransactions()
Get the set of open transactions associated with this database.
|
List<LogEntry> |
getUnappliedLog()
Get the unapplied
LogEntrys in this instance's Raft log. |
long |
getUnappliedLogMemoryUsage()
Get the estimated total memory used by unapplied log entries.
|
boolean |
isClusterMember()
Determine whether this node thinks that it is part of its cluster, as determined by its
current configuration.
|
boolean |
isClusterMember(String node)
Determine whether this node thinks that the specified node is part of the cluster, as determined by its
current configuration.
|
boolean |
isConfigured()
Determine whether this instance is configured.
|
boolean |
isFollowerProbingEnabled()
Determine whether follower probing prior to becoming a candidate is enabled.
|
void |
setCommitTimeout(int timeout)
Configure the default transaction commit timeout.
|
void |
setFollowerProbingEnabled(boolean followerProbingEnabled)
Configure whether followers should be required to probe for network connectivity with a majority of the
cluster after an election timeout prior to becoming a candidate.
|
void |
setHeartbeatTimeout(int timeout)
Configure the heartbeat timeout.
|
void |
setIdentity(String identity)
Configure the Raft identity.
|
void |
setKVStore(AtomicKVStore kvstore)
Configure the
AtomicKVStore in which local persistent state is stored. |
void |
setLogDirectory(File directory)
Configure the directory in which uncommitted log entries are stored.
|
void |
setMaxElectionTimeout(int timeout)
Configure the maximum election timeout.
|
void |
setMaxTransactionDuration(int duration)
Configure the maximum supported duration for outstanding transactions.
|
void |
setMaxUnappliedLogMemory(long memory)
Configure the maximum allowed memory used for unapplied log entries.
|
void |
setMinElectionTimeout(int timeout)
Configure the minimum election timeout.
|
void |
setNetwork(Network network)
Configure the
Network to use for inter-node communication. |
void |
start() |
void |
stop() |
String |
toString() |
public static final int DEFAULT_MIN_ELECTION_TIMEOUT
public static final int DEFAULT_MAX_ELECTION_TIMEOUT
public static final int DEFAULT_HEARTBEAT_TIMEOUT
public static final int DEFAULT_MAX_TRANSACTION_DURATION
public static final long DEFAULT_MAX_UNAPPLIED_LOG_MEMORY
public static final int DEFAULT_COMMIT_TIMEOUT
public static final int DEFAULT_TCP_PORT
public static final String OPTION_CONSISTENCY
createTransaction(Map). Value should be a Consistency instance,
or the name() thereof.public void setKVStore(AtomicKVStore kvstore)
AtomicKVStore in which local persistent state is stored.
Required property.
kvstore - local persistent data storeIllegalStateException - if this instance is already startedpublic void setLogDirectory(File directory)
Required property.
directory - log directoryIllegalStateException - if this instance is already startedpublic File getLogDirectory()
public void setNetwork(Network network)
Network to use for inter-node communication.
By default, a TCPNetwork instance communicating on DEFAULT_TCP_PORT is used.
network - network implementation; must not be startedIllegalStateException - if this instance is already startedpublic void setIdentity(String identity)
Required property.
identity - unique Raft identity of this node in its clusterIllegalStateException - if this instance is already startedpublic String getIdentity()
public void setMinElectionTimeout(int timeout)
This must be set to a value greater than the heartbeat timeout.
Default is DEFAULT_MIN_ELECTION_TIMEOUT.
Warning: currently all nodes must have the same configured minimum election timeout, otherwise read-only transactions are not guaranteed to be completely up-to-date.
timeout - minimum election timeout in millisecondsIllegalStateException - if this instance is already startedIllegalArgumentException - if timeout <= 0public int getMinElectionTimeout()
public void setMaxElectionTimeout(int timeout)
Default is DEFAULT_MAX_ELECTION_TIMEOUT.
timeout - maximum election timeout in millisecondsIllegalStateException - if this instance is already startedIllegalArgumentException - if timeout <= 0public int getMaxElectionTimeout()
public void setHeartbeatTimeout(int timeout)
This must be set to a value less than the minimum election timeout.
Default is DEFAULT_HEARTBEAT_TIMEOUT.
timeout - heartbeat timeout in millisecondsIllegalStateException - if this instance is already startedIllegalArgumentException - if timeout <= 0public int getHeartbeatTimeout()
public void setMaxTransactionDuration(int duration)
This value is the Tmax value from the overview. A larger value means more memory may be used.
This value may be changed while this instance is already running.
Default is DEFAULT_MAX_TRANSACTION_DURATION.
duration - maximum supported duration for outstanding transactions in millisecondsIllegalArgumentException - if duration <= 0setMaxUnappliedLogMemory(long)public int getMaxTransactionDuration()
public void setMaxUnappliedLogMemory(long memory)
This value is the Mmax value from the overview.
A higher value means transactions may be larger and/or stay open longer without causing a RetryTransactionException.
This value is approximate, and only affects leaders; followers always apply committed log entries immediately.
This value may be changed while this instance is already running.
Default is DEFAULT_MAX_UNAPPLIED_LOG_MEMORY.
memory - maximum allowed memory usage for cached applied log entriesIllegalArgumentException - if memory <= 0setMaxTransactionDuration(int)public long getMaxUnappliedLogMemory()
public void setCommitTimeout(int timeout)
This value determines how transactions will wait once commit()
is invoked for the commit to succeed before failing with a RetryTransactionException.
This can be overridden on a per-transaction basis via RaftKVTransaction.setTimeout(long).
This value may be changed while this instance is already running.
Default is DEFAULT_COMMIT_TIMEOUT.
timeout - transaction commit timeout in milliseconds, or zero for unlimitedIllegalArgumentException - if timeout is negativeRaftKVTransaction.setTimeout(long)public int getCommitTimeout()
public void setFollowerProbingEnabled(boolean followerProbingEnabled)
This value may be changed at any time.
The default is enabled.
followerProbingEnabled - true to enable, false to disablepublic boolean isFollowerProbingEnabled()
public int getClusterId()
A value of zero indicates an unconfigured system. Usually the reverse true, though an unconfigured system can have a non-zero cluster ID in the rare case where an error occurred persisting the initial log entry.
public Map<String,String> getCurrentConfig()
Configuration changes are performed and committed in the context of a normal transaction; see
RaftKVTransaction.configChange().
If this system is unconfigured, an empty map is returned (and vice-versa).
The returned map is a copy; changes have no effect on this instance.
public boolean isConfigured()
A node is configured if and only if it has at least one log entry. The first log entry always includes a configuration change that adds the node that created it to the (previously empty) cluster.
public boolean isClusterMember()
public boolean isClusterMember(String node)
node - node identitypublic Role getCurrentRole()
Role, or null if not runningpublic long getCurrentTerm()
public long getCurrentTermStartTime()
public long getCommitIndex()
public long getLastAppliedTerm()
public long getLastAppliedIndex()
public List<LogEntry> getUnappliedLog()
LogEntrys in this instance's Raft log.
The returned list is a copy; changes have no effect on this instance.
public long getUnappliedLogMemoryUsage()
public List<RaftKVTransaction> getOpenTransactions()
The returned set is a copy; changes have no effect on this instance.
@PostConstruct public void start()
start in interface KVDatabase@PreDestroy public void stop()
stop in interface KVDatabasepublic RaftKVTransaction createTransaction()
Equivalent to: createTransaction(Consistency.LINEARIZABLE).
createTransaction in interface KVDatabaseIllegalStateException - if this instance is not started or in the process of shutting downpublic RaftKVTransaction createTransaction(Map<String,?> options)
createTransaction in interface KVDatabasepublic RaftKVTransaction createTransaction(Consistency consistency)
Transactions that wish to use Consistency.EVENTUAL_COMMITTED must be created using this method,
because the log entry on which the transaction is based is determined at creation time.
consistency - consistency levelIllegalArgumentException - if consistency is nullIllegalStateException - if this instance is not started or in the process of shutting downCopyright © 2016. All rights reserved.