Faults & Partial Failures
On a single machine if there is any internal fault, we expect the system to crash instead of giving wrong values.
But in a distributed system, there are always systems that are broken while others are working as expected. This is called a partial failure. Partial failures are non-deterministic. Also we don’t even know if something has succeeded or failed because the time taken for message travelling across a network is non-deterministic. So we don’t know which systems have failed in between also we don’t know if a system has failed or not. That is why working with distributed systems is difficult .
Unreliable Networks
Distributed systems are shared-nothing systems. They communicate only via asynchronous message passing through the network. But the network is unreliable.
If a request is sent many things could go wrong
Request may have been lost (unplugged network cable)
Request may have been queued
Remote node has crashed and is unavailable.
Remote node has paused temporarily (for garbage collection)
Remote node has processed the request, but response got lost in the network.
Response is delayed because our network is overloaded.
In all these scenarios, it is difficult to tell why we didn’t get a response. We usually handle these with a timeout.
When part of a network is cut off from another part due to a network fault, it is called a network partition
Timeout is only way to detect a fault, but there are trade-offs
If timeout is too long, then during this time the system is unavailable and users might see error messages
If timeout is too short, and the node is actually alive and busy with some action, and another node takes over, the action can happen twice. When a node is declared dead, then its responsibilities are transferred to other nodes. If the system is busy, then removing a machine might exaggerate the problem. It could cause all nodes to declare dead and cause cascading system failure.
Most of the network delay is due to queuing. Since all the requests are sharing the same pipe, there will be some queuing when the traffic is high. Different places where queues are possible
If several nodes send data to the same destination, the network switch must queue them up and feed them into the destination network link one by one. On a busy network link, the data has to wait in the queue till it can get a slot. If the switch queue fills up, some packets may be dropped.
When data reaches the destination machine, if all CPUs are currently busy, then the OS queues up the request until application is ready to handle it. Here also the wait time is not bounded.
In virtualized environments, the running OS is paused for 10s of microseconds when the virtual machine wants to use the CPU code. In this case VM cannot consume any data from the network, so incoming data is queued by the VM monitor increasing the variability of network delays
TCP performs backpressure in which a node limits its own rate of sending to avoid overloading network links or destination node.
If some packets are lost TCP will retransmit them which adds to the delay
In cloud environments, the machines are shared. So based on the workload of neighbors, the network delays are variable.
So the best way to set timeouts is based on experience. Also a system could monitor the delays and change the timeouts accordingly.
Unreliable Clocks
In distributed systems, time handling is tricky because of variable network delays in communication.
Each machine has its own hardware clock, which has minor calibration differences compared to others and can be slightly faster or slower compared to other machines. The clocks can be synchronized using NTP (Network Time Protocol) which allows the time to be adjusted according to the time reported by a group of servers. These servers are in turn get their time from a more accurate source like GPS recievers.
Computers have two types of clocks
Time of day clock — Returns the current wall clock time. These clocks are adjusted using NTP. So they should ideally be same on all machines, but if a machine’s local clock is too fast, the time might be reset, so the time can appear to be going backwards.So these cannot be used to calculate a duration.
System.currentTimeMillis() or clock_getTime(CLOCK_REALTIME) (Linux) return wall clock time.
Monotonic Clock — These clocks have time that always increases or grows. So these can be used to calculate the duration. System.nanoTime in Java or clock_getTime(CLOCK_MONOTONIC) in Linux does this.
Multiple CPUs on a machine means it could be using multiple timers, but the system presents a uniform monotonic time.
When NTP needs to reset the time, it will fasten or slow the timer based on how much faster it is going, but it will never make the time go backward.
NTP synchronization issues
The local computer quartz might drift
If local time differs a lot from the NTP, then the system might refuse to sync or it might be forcibly reset which causes the time to go backwards
If a node is firewalled from internet, then it is difficult to detect the clock is out of sync
If there is large network delay, then the synchronization might not work correctly
NTP services might be reporting wrong time
Leap seconds (59 or 61 seconds in a minute)can cause problems with systems which are not designed with leap seconds in mind.
In virtual machines, the clock is virtualized. If a VM is paused for some time, then the application sees the clock jumps forward by those many seconds or minutes.
If the device is not controlled by the application, then it is difficult to assume the time accurate. User can change the time on the device.
So if your application depends on clock synchronization, you have monitor and take action if a clock drifts too far from the others.
If you use timestamp to determine the ordering, you might lose some database writes due to clock skew. Use version vectors for ordering
Clocks can have a confidence interval based on the precise clocks and transactions can wait for some time to be confident that transactions are linearizable. Google Spanners True Time API uses confidence intervals. In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap. In order to keep thewait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms
Process Pauses
Even if everything is synchronized properly, it is difficult to rely on time because of arbitrary pauses
It could be stop the world GC
It could be unexpected IO for e.g. a JVM will lazily load a classfile when first used
A virtual machine may be suspended and resumed
User may close the laptop and open it
CPU switches to different thread
swapping to disk because of paging
Unix process SIGSTOPed and resumed later with SIGCONT
In all these cases the paused code will start running when execution continues without knowing that time has elapsed.
Ideas to overcome process pauses
Treat a pause like a outage. If it is possible to know before that a process is going to need GC, the app can stop sending requests to the node and other nodes can take the work.
Rolling restart frequently so that long term memory is never GCed which will not cause the stop the world pause.
Knowledge, Truth & Lies
Truth is determined by a majority
A node cannot trust itself completely as there could have been a GC pause and all the threads were oblivious to it. There could be a network breakage where a node is able to recieve messages and send but other nodes are not able to recieve it.
So the decision is made by majority ( half or more nodes )
So in case of leader locks, when a lock is held by a node and a gc pause occurs and when the node comes back up after the pause, its lock lease might have expired. In that case other node might have become the leader and it might have lock. So when the old node tries to write the db should check if the token is still valid. It can be implemented with fencing tokens (monotonically increasing token numbers )
If the nodes arbitrarily lie by saying it didn’t recieve a response when it did, then it is called a Byzantine fault and the consensus problem is called byzantine Generals problem.
Dealing with weak forms of lying
If there are corrupted packets, then protocols like TCP and UDP use checksums to determine and fix or retry them.
Sanitize inputs from users
Configure a client with multiple NTP server addresses so that one of them being bad can be tolerated.
System Models
To assume what assumptions an apply for each type of systems, we can consider the following systems
Synchronous: This model assumes bounded network delay, bounded process pauses and bounded clock error. This is not a realistic model for distributed systems because of unbounded network delay.
Partially-Synchronous: In this model, the system is synchronous most of the time, but sometimes there may be unbounded delays, or pauses or clock error. This is a good model for distributed systems as they work well most of the time except sometimes.
Asynchronous: Here there are no time assumptions. Some algorithms can work with asynchronous design but it is very restrictive.
Node failure models
Crash stop — In this model a node can go down by crashing and will never come back.
Crash Recovery — In this model a node can go down temporarily but will recover after some time and be available. The nodes have stable storage that is preserved across failures whereas the in-memory state is lost.
Arbitrary faults- Nodes may do absolutely anything including tricking other nodes.
For modelling distributed systems, partially synchronous system with crash-recovery is the best model.
Correctness Models
Properties of a distributed system that will indicate its correctness for a fencing tokens algorithm.
Uniqueness — No two tokens are same
Monotonic Sequence — If a request x was returned at tx and y was returned at ty and x completed before y, then tx
Availability — The node that requested a fencing token didn’t crash and eventually receives a response.
Safety and liveness
In the above example uniqueness and monotonicity are safety property, but availability is a liveness property. Safety is defined as nothing bad happens whereas liveness is defined as eventually something good happens. Safety properties should always hold, i.e never be violated in a system. But for liveness there could be some leeway, saying that the system will be available only if majority of the nodes are not crashed.
Theoretically, even if a system satisfies all the correctness properties, it is not guaranteed to run correctly in real life as the model is just a simplification and not necessarily holds always. For e.g. in a partially synchronous system, we assumed that stable storage survives restarts. What if it that storage becomes corrupt. In that case we still need to handle it. But the theoretical model becomes a good approximation to reason about.
Source : https://books.google.com/books/about/Designing_Data_Intensive_Applications.html?id=zFheDgAAQBAJ
On a single machine if there is any internal fault, we expect the system to crash instead of giving wrong values.
But in a distributed system, there are always systems that are broken while others are working as expected. This is called a partial failure. Partial failures are non-deterministic. Also we don’t even know if something has succeeded or failed because the time taken for message travelling across a network is non-deterministic. So we don’t know which systems have failed in between also we don’t know if a system has failed or not. That is why working with distributed systems is difficult .
Unreliable Networks
Distributed systems are shared-nothing systems. They communicate only via asynchronous message passing through the network. But the network is unreliable.
If a request is sent many things could go wrong
Request may have been lost (unplugged network cable)
Request may have been queued
Remote node has crashed and is unavailable.
Remote node has paused temporarily (for garbage collection)
Remote node has processed the request, but response got lost in the network.
Response is delayed because our network is overloaded.
In all these scenarios, it is difficult to tell why we didn’t get a response. We usually handle these with a timeout.
When part of a network is cut off from another part due to a network fault, it is called a network partition
Timeout is only way to detect a fault, but there are trade-offs
If timeout is too long, then during this time the system is unavailable and users might see error messages
If timeout is too short, and the node is actually alive and busy with some action, and another node takes over, the action can happen twice. When a node is declared dead, then its responsibilities are transferred to other nodes. If the system is busy, then removing a machine might exaggerate the problem. It could cause all nodes to declare dead and cause cascading system failure.
Most of the network delay is due to queuing. Since all the requests are sharing the same pipe, there will be some queuing when the traffic is high. Different places where queues are possible
If several nodes send data to the same destination, the network switch must queue them up and feed them into the destination network link one by one. On a busy network link, the data has to wait in the queue till it can get a slot. If the switch queue fills up, some packets may be dropped.
When data reaches the destination machine, if all CPUs are currently busy, then the OS queues up the request until application is ready to handle it. Here also the wait time is not bounded.
In virtualized environments, the running OS is paused for 10s of microseconds when the virtual machine wants to use the CPU code. In this case VM cannot consume any data from the network, so incoming data is queued by the VM monitor increasing the variability of network delays
TCP performs backpressure in which a node limits its own rate of sending to avoid overloading network links or destination node.
If some packets are lost TCP will retransmit them which adds to the delay
In cloud environments, the machines are shared. So based on the workload of neighbors, the network delays are variable.
So the best way to set timeouts is based on experience. Also a system could monitor the delays and change the timeouts accordingly.
Unreliable Clocks
In distributed systems, time handling is tricky because of variable network delays in communication.
Each machine has its own hardware clock, which has minor calibration differences compared to others and can be slightly faster or slower compared to other machines. The clocks can be synchronized using NTP (Network Time Protocol) which allows the time to be adjusted according to the time reported by a group of servers. These servers are in turn get their time from a more accurate source like GPS recievers.
Computers have two types of clocks
Time of day clock — Returns the current wall clock time. These clocks are adjusted using NTP. So they should ideally be same on all machines, but if a machine’s local clock is too fast, the time might be reset, so the time can appear to be going backwards.So these cannot be used to calculate a duration.
System.currentTimeMillis() or clock_getTime(CLOCK_REALTIME) (Linux) return wall clock time.
Monotonic Clock — These clocks have time that always increases or grows. So these can be used to calculate the duration. System.nanoTime in Java or clock_getTime(CLOCK_MONOTONIC) in Linux does this.
Multiple CPUs on a machine means it could be using multiple timers, but the system presents a uniform monotonic time.
When NTP needs to reset the time, it will fasten or slow the timer based on how much faster it is going, but it will never make the time go backward.
NTP synchronization issues
The local computer quartz might drift
If local time differs a lot from the NTP, then the system might refuse to sync or it might be forcibly reset which causes the time to go backwards
If a node is firewalled from internet, then it is difficult to detect the clock is out of sync
If there is large network delay, then the synchronization might not work correctly
NTP services might be reporting wrong time
Leap seconds (59 or 61 seconds in a minute)can cause problems with systems which are not designed with leap seconds in mind.
In virtual machines, the clock is virtualized. If a VM is paused for some time, then the application sees the clock jumps forward by those many seconds or minutes.
If the device is not controlled by the application, then it is difficult to assume the time accurate. User can change the time on the device.
So if your application depends on clock synchronization, you have monitor and take action if a clock drifts too far from the others.
If you use timestamp to determine the ordering, you might lose some database writes due to clock skew. Use version vectors for ordering
Clocks can have a confidence interval based on the precise clocks and transactions can wait for some time to be confident that transactions are linearizable. Google Spanners True Time API uses confidence intervals. In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the length of the confidence interval before committing a read-write transaction. By doing so, it ensures that any transaction that may read the data is at a sufficiently later time, so their confidence intervals do not overlap. In order to keep thewait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms
Process Pauses
Even if everything is synchronized properly, it is difficult to rely on time because of arbitrary pauses
It could be stop the world GC
It could be unexpected IO for e.g. a JVM will lazily load a classfile when first used
A virtual machine may be suspended and resumed
User may close the laptop and open it
CPU switches to different thread
swapping to disk because of paging
Unix process SIGSTOPed and resumed later with SIGCONT
In all these cases the paused code will start running when execution continues without knowing that time has elapsed.
Ideas to overcome process pauses
Treat a pause like a outage. If it is possible to know before that a process is going to need GC, the app can stop sending requests to the node and other nodes can take the work.
Rolling restart frequently so that long term memory is never GCed which will not cause the stop the world pause.
Knowledge, Truth & Lies
Truth is determined by a majority
A node cannot trust itself completely as there could have been a GC pause and all the threads were oblivious to it. There could be a network breakage where a node is able to recieve messages and send but other nodes are not able to recieve it.
So the decision is made by majority ( half or more nodes )
So in case of leader locks, when a lock is held by a node and a gc pause occurs and when the node comes back up after the pause, its lock lease might have expired. In that case other node might have become the leader and it might have lock. So when the old node tries to write the db should check if the token is still valid. It can be implemented with fencing tokens (monotonically increasing token numbers )
If the nodes arbitrarily lie by saying it didn’t recieve a response when it did, then it is called a Byzantine fault and the consensus problem is called byzantine Generals problem.
Dealing with weak forms of lying
If there are corrupted packets, then protocols like TCP and UDP use checksums to determine and fix or retry them.
Sanitize inputs from users
Configure a client with multiple NTP server addresses so that one of them being bad can be tolerated.
System Models
To assume what assumptions an apply for each type of systems, we can consider the following systems
Synchronous: This model assumes bounded network delay, bounded process pauses and bounded clock error. This is not a realistic model for distributed systems because of unbounded network delay.
Partially-Synchronous: In this model, the system is synchronous most of the time, but sometimes there may be unbounded delays, or pauses or clock error. This is a good model for distributed systems as they work well most of the time except sometimes.
Asynchronous: Here there are no time assumptions. Some algorithms can work with asynchronous design but it is very restrictive.
Node failure models
Crash stop — In this model a node can go down by crashing and will never come back.
Crash Recovery — In this model a node can go down temporarily but will recover after some time and be available. The nodes have stable storage that is preserved across failures whereas the in-memory state is lost.
Arbitrary faults- Nodes may do absolutely anything including tricking other nodes.
For modelling distributed systems, partially synchronous system with crash-recovery is the best model.
Correctness Models
Properties of a distributed system that will indicate its correctness for a fencing tokens algorithm.
Uniqueness — No two tokens are same
Monotonic Sequence — If a request x was returned at tx and y was returned at ty and x completed before y, then tx
Availability — The node that requested a fencing token didn’t crash and eventually receives a response.
Safety and liveness
In the above example uniqueness and monotonicity are safety property, but availability is a liveness property. Safety is defined as nothing bad happens whereas liveness is defined as eventually something good happens. Safety properties should always hold, i.e never be violated in a system. But for liveness there could be some leeway, saying that the system will be available only if majority of the nodes are not crashed.
Theoretically, even if a system satisfies all the correctness properties, it is not guaranteed to run correctly in real life as the model is just a simplification and not necessarily holds always. For e.g. in a partially synchronous system, we assumed that stable storage survives restarts. What if it that storage becomes corrupt. In that case we still need to handle it. But the theoretical model becomes a good approximation to reason about.
Source : https://books.google.com/books/about/Designing_Data_Intensive_Applications.html?id=zFheDgAAQBAJ
No comments:
Post a Comment
Comments