zl程序教程

您现在的位置是:首页 >  大数据

当前栏目

如何实现分布式锁?

分布式分布式 实现 如何
2023-09-14 09:07:20 时间

如何实现分布式锁?

Martin Kleppmann


作为本书研究的一部分,我在Redis网站上 发现了一种名为Redlock的算法。该算法声称 在Redis之上实现容错分布式锁(或者更确切地说, 租约 [1]),并且该页面要求来自分布式系统的人员的反馈。算法本能地在脑海中引发了一些警钟,所以我花了一些时间思考它并写下这些。

由于已经有超过10个独立的Redlock实现,我们不知道谁已经依赖这个算法,我认为值得公开分享我的笔记。我不会涉及Redis的其他方面,其中一些已在其他地方被批评过 。

在我详细介绍Redlock之前,请允许我说我非常喜欢Redis,过去我已成功将它用于生产中。我认为它非常适合您希望在服务器之间共享一些瞬态,近似,快速变化的数据的情况,以及如果您因某种原因偶尔丢失数据并不是什么大不了的地方。例如,一个好的用例是维护每个IP地址的请求计数器(用于速率限制)和每个用户ID的不同IP地址集(用于滥用检测)。

然而,Redis已经逐渐进入数据管理领域,这些领域具有更强的一致性和耐用性预期 - 这让我担心,因为这不是Redis的设计目标。可以说,分布式锁定是其中一个领域。让我们更详细地研究它。

你用什么锁?

锁定的目的是确保在可能尝试执行相同工作的多个节点中,只有一个实际执行它(至少一次只执行一次)。这项工作可能是将一些数据写入共享存储系统,执行某些计算,调用某些外部API等。在较高的层次上,有两个原因可能导致您需要锁定分布式应用程序: 效率或正确性  [2]。为了区分这些情况,您可以询问锁定失败会发生什么:

  • 效率:锁定可以避免不必要地执行相同的工作两次(例如,一些昂贵的计算)。如果锁定失败并且两个节点最终完成相同的工作,结果是成本略有增加(最终为AWS支付的费用比您原本要多5美分)或稍有不便(例如用户最终)两次收到相同的电子邮件通知)。
  • 正确性:采取锁定可防止并发进程踩到彼此的脚趾并弄乱系统状态。如果锁定失败并且两个节点同时处理同一条数据,则结果是文件损坏,数据丢失,永久性不一致,给予患者的药物剂量错误或其他一些严重问题。

两者都是想要锁定的有效案例,但是你需要非常清楚你要处理的两个中的哪一个。

我将争辩说,如果您仅仅为了提高效率而使用锁,则不必承担Redlock的成本和复杂性,运行5台Redis服务器并检查大多数锁以获取锁。您最好只使用一个Redis实例,可能在主要崩溃的情况下使用异步复制到辅助实例。

如果您使用单个Redis实例,当然如果Redis节点上的电源突然断电或者出现其他问题,您将丢弃一些锁。但是如果你只是使用锁作为效率优化,并且崩溃不会经常发生,那没什么大不了的。这个“没什么大不了的”情景是Redis闪耀的地方。至少如果您依赖于单个Redis实例,那么每个看系统的人都很清楚锁是近似的,并且仅用于非关键目的。

另一方面,具有5个副本和多数表决权的Redlock算法乍一看,好像它适用于锁定对正确性很重要的情况。我将在以下各节中论证它适合这个目的。对于本文的其余部分,我们将假设您的锁对于正确性很重要,并且如果两个不同的节点同时认为它们持有相同的锁,那么这是一个严重的错误。

使用锁保护资源

让我们暂时搁置Redlock的细节,并讨论如何使用分布式锁(与所使用的特定锁定算法无关)。重要的是要记住,分布式系统中的锁不像多线程应用程序中的互斥锁。这是一个更复杂的野兽,因为不同的节点和网络都可以以各种方式独立地失败。

例如,假设您有一个应用程序,其中客户端需要更新共享存储中的文件(例如HDFS或S3)。客户端首先获取锁,然后读取文件,进行一些更改,将修改后的文件写回,最后释放锁。该锁防止两个客户端同时执行此读 - 修改 - 写周期,这将导致更新丢失。代码可能如下所示:

// THIS CODE IS BROKEN
function writeData(filename, data) {
    var lock = lockService.acquireLock(filename);
    if (!lock) {
        throw 'Failed to acquire lock';
    }

    try {
        var file = storage.readFile(filename);
        var updated = updateContents(file, data);
        storage.writeFile(filename, updated);
    } finally {
        lock.release();
    }
}

不幸的是,即使您拥有完美的锁定服务,上面的代码也会被破坏。下图显示了如何最终处理损坏的数据:

对分布式锁保护的资源的不安全访问

在此示例中,获取锁的客户端在持有锁的同时暂停一段时间 - 例如,因为垃圾收集器(GC)启动。锁具有超时(即它是租约),这是总是一个好主意(否则一个崩溃的客户端可能会永远持有一个锁,永远不会释放它)。但是,如果GC暂停持续时间超过租约到期时间,并且客户端没有意识到它已经过期,则可能会继续进行并进行一些不安全的更改。

这个错误不是理论上的:HBase曾经有过这个问题  [3,4]。通常情况下,GC暂停非常短暂,但“停止世界”的GC暂停有时会持续 几分钟  [5] - 肯定足够长的时间让租约到期。即使是所谓的“并发”垃圾收集器,如HotSpot JVM的CMS也无法与应用程序代码并行运行 - 即使他们需要不时地停止这个世界 [6]。

在写回存储之前,通过在锁定到期时插入检查来解决此问题。请记住,GC可以在任何时候暂停正在运行的线程,包括对您来说最不方便的点(在最后一次检查和写入操作之间)。

如果您感到自鸣得意,因为您的编程语言运行时没有长时间的GC暂停,那么您的进程可能会暂停的原因还有很多。也许您的进程尝试读取尚未加载到内存中的地址,因此它会出现页面错误并暂停,直到从磁盘加载页面为止。也许您的磁盘实际上是EBS,因此无意中读取变量会变成亚马逊拥塞网络上的同步网络请求。也许有许多其他进程争用CPU,并且您在调度程序树中遇到了一个黑色节点。也许有人不小心将SIGSTOP发送给了这个过程。随你。您的流程将暂停。

如果您仍然不相信过程暂停,请考虑在到达存储服务之前,文件写入请求可能会在网络中延迟。诸如以太网和IP之类的分组网络可能会任意延迟数据包,并且它们会  [7]:在GitHub的一个着名 事件中,数据包在网络中延迟了大约90秒[8]。这意味着应用程序进程可以发送写入请求,并且可以在租约已经过期的一分钟后到达存储服务器。

即使在管理良好的网络中,也会发生这种情况。您根本无法对时序做出任何假设,这就是为什么上面的代码根本不安全,无论您使用什么锁服务。

使用围栏使锁安全

解决此问题的方法实际上非常简单:您需要在存储服务的每个写入请求中包含一个防护标记。在此上下文中,防护令牌只是每次客户端获取锁定时增加(例如,由锁定服务递增)的数字。如下图所示:

使用fencing令牌可以安全地访问资源

客户端1获取租约并获得33的令牌,但随后它会进入长时间停顿并且租约到期。客户端2获取租约,获得34的令牌(数量总是增加),然后将其写入发送到存储服务,包括令牌34.稍后,客户端1恢复生命并将其写入发送到存储服务但是,存储服务器会记住它已经处理了具有更高令牌号的写入(34),因此它拒绝带有令牌33的请求。

请注意,这要求存储服务器在检查令牌时起主动作用,并拒绝令牌反向的任何写入。但是,一旦你知道了这个诀窍,这并不是特别难。并且只要锁定服务生成严格单调增加的令牌,这就使锁定安全。例如,如果您使用ZooKeeper作为锁定服务,则可以使用zxid 或znode版本号作为防护令牌,并且您处于良好状态[3]。

然而,这导致我们遇到Redlock的第一个大问题:它没有任何生成防护令牌的工具。每次客户端获取锁定时,该算法不会产生任何保证增加的数字。这意味着即使算法完美无缺,也不会安全使用,因为在一个客户端暂停或其数据包被延迟的情况下,您无法阻止客户端之间的竞争条件。

对我来说,如何更改Redlock算法以开始生成fencing令牌并不明显。它使用的唯一随机值不能提供所需的单调性。简单地将计数器保留在一个Redis节点上是不够的,因为该节点可能会失败。将计数器保留在几个节点上意味着它们会不同步。您可能需要一个一致的算法来生成防护令牌。(如果只是增加一个计数器很简单。)

利用时间解决共识

Redlock无法生成防护令牌这一事实应该已经足以成为在正确性取决于锁定的情况下不使用它的理由。但还有一些值得讨论的问题。

在学术文献中,这种算法最实用的系统模型是具有不可靠故障检测器的 异步模型  [9]。简单来说,这意味着算法不会对时序做出任何假设:进程可能暂停任意长度的时间,数据包可能在网络中被任意延迟,时钟可能是任意错误的 - 然而算法仍然可以正确执行事情。鉴于我们上面讨论的内容,这些都是非常合理的假设。

算法可以使用时钟的唯一目的是生成超时,以避免在节点关闭时永远等待。但是超时不必是准确的:仅仅因为请求超时,这并不意味着另一个节点肯定是关闭的 - 它也可能是因为网络中存在大的延迟,或者您的本地时钟是错的。当用作故障检测器时,超时只是猜测出现了问题。(如果可以的话,分布式算法将完全没有时钟,但随后协商变得不可能  [10]。获取锁定就像是比较和设置操作,这需要共识  [11]。)

请注意,Redis 使用gettimeofday而不是单调时钟来确定密钥到期时间。手册页gettimeofday 明确表示它返回的时间会受制于系统时间的不连续跳跃 - 也就是说,它可能会突然向前跳几分钟,甚至可能会跳回时间(例如,如果时钟由NTP步进,因为它与NTP服务器的区别太大,或者如果时钟由管理员手动调整)。因此,如果系统时钟做了奇怪的事情,很容易发生Redis中的密钥到期比预期快得多或慢得多。

对于异步模型中的算法,这不是一个大问题:这些算法通常确保其安全属性始终保持不变,而不做任何时序假设  [12]。只有活动属性取决于超时或其他一些故障检测器。简单来说,这意味着即使系统中的时间到处都是(进程暂停,网络延迟,时钟向前和向后跳跃),算法的性能可能会下降,但算法永远不会成为错误的决定。

但是,Redlock不是这样的。它的安全性取决于许多时序假设:它假设所有Redis节点在到期之前保持密钥大约合适的时间长度; 与到期时间相比,网络延迟较小; 并且该过程暂停比到期持续时间短得多。

用糟糕的时间打破Redlock

让我们看一些例子来证明Redlock对时序假设的依赖。假设系统有五个Redis节点(A,B,C,D和E)和两个客户端(1和2)。如果其中一个Redis节点上的时钟跳转,会发生什么?

  1. 客户端1获取节点A,B,C上的锁定。由于网络问题,无法访问D和E.
  2. 节点C上的时钟向前跳跃,导致锁定到期。
  3. 客户端2获取节点C,D,E上的锁定。由于网络问题,无法访问A和B.
  4. 客户1和2现在都认为他们持有锁。

如果C在将锁持久保存到磁盘之前崩溃并且立即重新启动,则可能发生类似的问题。出于这个原因,Redlock文档建议至少延迟崩溃节点的重启,以获得最长寿命锁的生存时间。但是这种重启延迟再次依赖于合理准确的时间测量,并且如果时钟跳跃则会失败。

好吧,也许你认为时钟跳转是不现实的,因为你非常有信心正确配置NTP只能转换时钟。在这种情况下,让我们看一个进程暂停如何导致算法失败的示例:

  1. 客户端1请求锁定节点A,B,C,D,E。
  2. 虽然对客户端1的响应正在进行中,但客户端1进入了停止世界的GC。
  3. 锁定在所有Redis节点上过期。
  4. 客户端2获取节点A,B,C,D,E上的锁。
  5. 客户端1完成GC,并从Redis节点接收响应,指示它已成功获取锁(它们在进程暂停时保存在客户端1的内核网络缓冲区中)。
  6. 客户1和2现在都认为他们持有锁。

请注意,即使Redis是用C语言编写的,因此没有GC,但这对我们没有帮助:客户端可能遇到GC暂停的任何系统都存在此问题。您只能在客户端2获取锁定后阻止客户端1在锁定下执行任何操作,例如使用上面的防护方法,从而使此安全。

长网络延迟可以产生与进程暂停相同的效果。它可能取决于您的TCP用户超时 - 如果您使超时显着短于Redis TTL,可能会忽略延迟的网络数据包,但我们必须详细查看TCP实现以确保。此外,随着超时,我们又回到了时间测量的准确性!

Redlock的同步假设

这些示例显示只有在您采用同步系统模型时Redlock才能正常工作- 即具有以下属性的系统:

  • 有限的网络延迟(您可以保证数据包总是在一些保证的最大延迟内到达),
  • 有限的过程暂停(换句话说,硬实时约束,通常只能在汽车安全气囊系统中找到),以及
  • 有限的时钟错误(交叉你的手指你没有从坏的NTP服务器得到你的时间)。

请注意,同步模型并不意味着完全同步的时钟:它意味着您假设已知的,固定的网络延迟上限,暂停和时钟漂移[12]。Redlock假设延迟,暂停和漂移相对于锁的生存时间都很小; 如果时间问题变得与生存时间一样大,则算法失败。

在性能相当良好的数据中心环境中,时序假设将在大多数 时间内得到满足- 这被称为部分同步系统  [12]。但这还够好吗?一旦这些时间假设被打破,Redlock可能会违反其安全属性,例如在另一个客户端到期之前向其授予租约。如果你依赖于锁定的正确性,“大部分时间”是不够的 - 你需要它始终是正确的。

有大量证据表明,对于大多数实际系统环境,假设同步系统模型是不安全的[7,8]。使用90秒的数据包延迟提醒自己GitHub事件 。Redlock不太可能在Jepsen测试中存活下来。

另一方面,为部分同步系统模型(或具有故障检测器的异步模型)设计的一致性算法实际上有机会工作。Raft,Viewstamped Replication,Zab和Paxos都属于这一类。这样的算法必须放弃所有时序假设。这很难:假设网络,流程和时钟比实际更可靠,这是很诱人的。但是在分布式系统的混乱现实中,你必须非常小心你的假设。

结论

我认为Redlock算法是一个糟糕的选择,因为它“既不是鱼也不是鸡”:它对于效率优化锁是不必要的重量级和昂贵的,但是对于正确性取决于锁的情况它并不足够安全。

特别是,该算法对时序和系统时钟做出了危险的假设(基本上假设一个同步系统具有有界网络延迟和有限的操作执行时间),如果不满足这些假设,它就违反了安全属性。此外,它缺乏用于生成防护令牌的设施(其保护系统免受网络中的长时间延迟或暂停的过程)。

如果您只需要尽力而为(仅作为效率优化,而不是正确性),我建议坚持使用Redis 的简单单节点锁定算法(条件集 - 如果不存在则获得锁定, atomic delete-if-value-matches以释放锁定,并在您的代码中非常清楚地记录锁只是近似的,有时可能会失败。不要费心设置五个Redis节点的集群。

另一方面,如果您需要锁定正确,请不要使用Redlock。相反,请使用适当的共识系统,如ZooKeeper,可能通过 实施锁定的Curator配方之一。(至少,使用具有合理事务保证数据库。)并且请在锁定下的所有资源访问上强制使用fencing令牌。

正如我在开始时所说的,如果你正确使用它,Redis是一个很好的工具。以上都没有减少Redis用于其预期目的的有用性。Salvatore多年来一直致力于该项目,其成功当之无愧。但是每种工具都有局限性,了解它们并进行相应的规划非常重要。

如果您想了解更多信息,我会在本书的第8章和第9章中更详细地解释这个主题,现在可以在O'Reilly的早期版本中找到。(上图来自我的书。)为了学习如何使用ZooKeeper,我推荐Junqueira和Reed的书  [3]。为了更好地介绍分布式系统理论,我推荐了Cachin,Guerraoui和Rodrigues的教科书  [13]。

感谢Kyle KingsburyCamille FournierFlavio Junqueira和 Salvatore Sanfilippo审阅本文的草稿。当然,任何错误都是我的。

2016年2月9日更新: Redlock的原作者 Salvatore 发表了对本文的反驳(另见 HN讨论)。他提出了一些好的观点,但我坚持我的结论。如果我有时间,我可以在后续帖子中详细说明,但请形成您自己的意见 - 请参考下面的参考资料,其中许多已获得严格的学术同行评审(与我们的任何博客文章不同)。

参考

[1] Cary G Gray和David R Cheriton:“ 租约:分布式文件高速缓存一致性的高效容错机制 ”,第12届ACM操作系统原理研讨会(SOSP),1989年12月 .doi:10.1145 / 74850.74870

[2] Mike Burrows:松散耦合分布式系统的Chubby锁定服务,“ 第7届USENIX操作系统设计与实现研讨会(OSDI),2006年11月。

[3] Flavio P Junqueira和Benjamin Reed: ZooKeeper:分布式流程协调。O'Reilly Media,2013年11月.ISBN:978-1-4493-6130-3

[4]EnisSöztutar:“ HBase和HDFS:了解HBase中的文件系统使用情况 ”,HBaseCon,2013年6月。

[5] Todd Lipcon:“ 使用MemStore-Local Allocation Buffers避免Apache HBase中的完整GC:第1部分,”blog.cloudera.com,2011年2月24日。

[6] Martin Thompson:“ Java Garbage Collection Distilled ”,mechanical-sympathy.blogspot.co.uk,2013年7月16日。

[7] Peter Bailis和Kyle Kingsbury:“ 网络可靠,” ACM Queue,第12卷,第7期,2014年7月 .doi:10.1145 / 2639988.2639988

[8] Mark Imbriaco:“ 上周六的停工时间 ”,github.com,2012年12月26日。

[9] Tushar Deepak Chandra和Sam Toueg:“ 用于可靠分布式系统的不可靠故障探测器 ” ,ACM期刊,第43卷,第2期,第225-267页,1996年3月 .doi:10.1145 / 226643.226647

[10] Michael J Fischer,Nancy Lynch和Michael S Paterson:“ 在一个错误的过程中不可能达成分布式共识 ” ,ACM杂志,第32卷,第2期,第374 - 382页,1985年4月 .doi:10.1145 / 3149.214121

[11] Maurice P Herlihy:“ 等待免费同步 ”, ACM Transactions on Programming Languages and Systems,第13卷,第1期,第124-149页,1991年1月 .doi:10.1145 / 114005.102808

[12] Cynthia Dwork,Nancy Lynch和Larry Stockmeyer:“ 存在部分同步的共识 ” ,ACM期刊,第35卷,第2期,第288-323页,1988年4月 .doi:10.1145 / 42282.42283

[13] Christian Cachin,Rachid Guerraoui和LuísRodrigues: 可靠和安全的分布式编程简介,第二版。Springer,2011年2月.ISBN:978-3-642-15259-7, doi:10.1007 / 978-3-642-15260-3

 

 

原文:https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html

How to do distributed locking

Martin Kleppmann

 

As part of the research for my book, I came across an algorithm called Redlock on the Redis website. The algorithm claims to implement fault-tolerant distributed locks (or rather, leases [1]) on top of Redis, and the page asks for feedback from people who are into distributed systems. The algorithm instinctively set off some alarm bells in the back of my mind, so I spent a bit of time thinking about it and writing up these notes.

Since there are already over 10 independent implementations of Redlock and we don’t know who is already relying on this algorithm, I thought it would be worth sharing my notes publicly. I won’t go into other aspects of Redis, some of which have already been critiqued elsewhere.

Before I go into the details of Redlock, let me say that I quite like Redis, and I have successfully used it in production in the past. I think it’s a good fit in situations where you want to share some transient, approximate, fast-changing data between servers, and where it’s not a big deal if you occasionally lose that data for whatever reason. For example, a good use case is maintaining request counters per IP address (for rate limiting purposes) and sets of distinct IP addresses per user ID (for abuse detection).

However, Redis has been gradually making inroads into areas of data management where there are stronger consistency and durability expectations – which worries me, because this is not what Redis is designed for. Arguably, distributed locking is one of those areas. Let’s examine it in some more detail.

What are you using that lock for?

The purpose of a lock is to ensure that among several nodes that might try to do the same piece of work, only one actually does it (at least only one at a time). That work might be to write some data to a shared storage system, to perform some computation, to call some external API, or suchlike. At a high level, there are two reasons why you might want a lock in a distributed application: for efficiency or for correctness [2]. To distinguish these cases, you can ask what would happen if the lock failed:

  • Efficiency: Taking a lock saves you from unnecessarily doing the same work twice (e.g. some expensive computation). If the lock fails and two nodes end up doing the same piece of work, the result is a minor increase in cost (you end up paying 5 cents more to AWS than you otherwise would have) or a minor inconvenience (e.g. a user ends up getting the same email notification twice).
  • Correctness: Taking a lock prevents concurrent processes from stepping on each others’ toes and messing up the state of your system. If the lock fails and two nodes concurrently work on the same piece of data, the result is a corrupted file, data loss, permanent inconsistency, the wrong dose of a drug administered to a patient, or some other serious problem.

Both are valid cases for wanting a lock, but you need to be very clear about which one of the two you are dealing with.

I will argue that if you are using locks merely for efficiency purposes, it is unnecessary to incur the cost and complexity of Redlock, running 5 Redis servers and checking for a majority to acquire your lock. You are better off just using a single Redis instance, perhaps with asynchronous replication to a secondary instance in case the primary crashes.

If you use a single Redis instance, of course you will drop some locks if the power suddenly goes out on your Redis node, or something else goes wrong. But if you’re only using the locks as an efficiency optimization, and the crashes don’t happen too often, that’s no big deal. This “no big deal” scenario is where Redis shines. At least if you’re relying on a single Redis instance, it is clear to everyone who looks at the system that the locks are approximate, and only to be used for non-critical purposes.

On the other hand, the Redlock algorithm, with its 5 replicas and majority voting, looks at first glance as though it is suitable for situations in which your locking is important for correctness. I will argue in the following sections that it is not suitable for that purpose. For the rest of this article we will assume that your locks are important for correctness, and that it is a serious bug if two different nodes concurrently believe that they are holding the same lock.

Protecting a resource with a lock

Let’s leave the particulars of Redlock aside for a moment, and discuss how a distributed lock is used in general (independent of the particular locking algorithm used). It’s important to remember that a lock in a distributed system is not like a mutex in a multi-threaded application. It’s a more complicated beast, due to the problem that different nodes and the network can all fail independently in various ways.

For example, say you have an application in which a client needs to update a file in shared storage (e.g. HDFS or S3). A client first acquires the lock, then reads the file, makes some changes, writes the modified file back, and finally releases the lock. The lock prevents two clients from performing this read-modify-write cycle concurrently, which would result in lost updates. The code might look something like this:

// THIS CODE IS BROKEN
function writeData(filename, data) {
    var lock = lockService.acquireLock(filename);
    if (!lock) {
        throw 'Failed to acquire lock';
    }

    try {
        var file = storage.readFile(filename);
        var updated = updateContents(file, data);
        storage.writeFile(filename, updated);
    } finally {
        lock.release();
    }
}

Unfortunately, even if you have a perfect lock service, the code above is broken. The following diagram shows how you can end up with corrupted data:

Unsafe access to a resource protected by a distributed lock

In this example, the client that acquired the lock is paused for an extended period of time while holding the lock – for example because the garbage collector (GC) kicked in. The lock has a timeout (i.e. it is a lease), which is always a good idea (otherwise a crashed client could end up holding a lock forever and never releasing it). However, if the GC pause lasts longer than the lease expiry period, and the client doesn’t realise that it has expired, it may go ahead and make some unsafe change.

This bug is not theoretical: HBase used to have this problem [3,4]. Normally, GC pauses are quite short, but “stop-the-world” GC pauses have sometimes been known to last for several minutes [5] – certainly long enough for a lease to expire. Even so-called “concurrent” garbage collectors like the HotSpot JVM’s CMS cannot fully run in parallel with the application code – even they need to stop the world from time to time [6].

You cannot fix this problem by inserting a check on the lock expiry just before writing back to storage. Remember that GC can pause a running thread at any point, including the point that is maximally inconvenient for you (between the last check and the write operation).

And if you’re feeling smug because your programming language runtime doesn’t have long GC pauses, there are many other reasons why your process might get paused. Maybe your process tried to read an address that is not yet loaded into memory, so it gets a page fault and is paused until the page is loaded from disk. Maybe your disk is actually EBS, and so reading a variable unwittingly turned into a synchronous network request over Amazon’s congested network. Maybe there are many other processes contending for CPU, and you hit a black node in your scheduler tree. Maybe someone accidentally sent SIGSTOP to the process. Whatever. Your processes will get paused.

If you still don’t believe me about process pauses, then consider instead that the file-writing request may get delayed in the network before reaching the storage service. Packet networks such as Ethernet and IP may delay packets arbitrarily, and they do [7]: in a famous incident at GitHub, packets were delayed in the network for approximately 90 seconds [8]. This means that an application process may send a write request, and it may reach the storage server a minute later when the lease has already expired.

Even in well-managed networks, this kind of thing can happen. You simply cannot make any assumptions about timing, which is why the code above is fundamentally unsafe, no matter what lock service you use.

Making the lock safe with fencing

The fix for this problem is actually pretty simple: you need to include a fencing token with every write request to the storage service. In this context, a fencing token is simply a number that increases (e.g. incremented by the lock service) every time a client acquires the lock. This is illustrated in the following diagram:

Using fencing tokens to make resource access safe

Client 1 acquires the lease and gets a token of 33, but then it goes into a long pause and the lease expires. Client 2 acquires the lease, gets a token of 34 (the number always increases), and then sends its write to the storage service, including the token of 34. Later, client 1 comes back to life and sends its write to the storage service, including its token value 33. However, the storage server remembers that it has already processed a write with a higher token number (34), and so it rejects the request with token 33.

Note this requires the storage server to take an active role in checking tokens, and rejecting any writes on which the token has gone backwards. But this is not particularly hard, once you know the trick. And provided that the lock service generates strictly monotonically increasing tokens, this makes the lock safe. For example, if you are using ZooKeeper as lock service, you can use the zxid or the znode version number as fencing token, and you’re in good shape [3].

However, this leads us to the first big problem with Redlock: it does not have any facility for generating fencing tokens. The algorithm does not produce any number that is guaranteed to increase every time a client acquires a lock. This means that even if the algorithm were otherwise perfect, it would not be safe to use, because you cannot prevent the race condition between clients in the case where one client is paused or its packets are delayed.

And it’s not obvious to me how one would change the Redlock algorithm to start generating fencing tokens. The unique random value it uses does not provide the required monotonicity. Simply keeping a counter on one Redis node would not be sufficient, because that node may fail. Keeping counters on several nodes would mean they would go out of sync. It’s likely that you would need a consensus algorithm just to generate the fencing tokens. (If only incrementing a counter was simple.)

Using time to solve consensus

The fact that Redlock fails to generate fencing tokens should already be sufficient reason not to use it in situations where correctness depends on the lock. But there are some further problems that are worth discussing.

In the academic literature, the most practical system model for this kind of algorithm is the asynchronous model with unreliable failure detectors [9]. In plain English, this means that the algorithms make no assumptions about timing: processes may pause for arbitrary lengths of time, packets may be arbitrarily delayed in the network, and clocks may be arbitrarily wrong – and the algorithm is nevertheless expected to do the right thing. Given what we discussed above, these are very reasonable assumptions.

The only purpose for which algorithms may use clocks is to generate timeouts, to avoid waiting forever if a node is down. But timeouts do not have to be accurate: just because a request times out, that doesn’t mean that the other node is definitely down – it could just as well be that there is a large delay in the network, or that your local clock is wrong. When used as a failure detector, timeouts are just a guess that something is wrong. (If they could, distributed algorithms would do without clocks entirely, but then consensus becomes impossible [10]. Acquiring a lock is like a compare-and-set operation, which requires consensus [11].)

Note that Redis uses gettimeofday, not a monotonic clock, to determine the expiry of keys. The man page for gettimeofday explicitly says that the time it returns is subject to discontinuous jumps in system time – that is, it might suddenly jump forwards by a few minutes, or even jump back in time (e.g. if the clock is stepped by NTP because it differs from a NTP server by too much, or if the clock is manually adjusted by an administrator). Thus, if the system clock is doing weird things, it could easily happen that the expiry of a key in Redis is much faster or much slower than expected.

For algorithms in the asynchronous model this is not a big problem: these algorithms generally ensure that their safety properties always hold, without making any timing assumptions [12]. Only liveness properties depend on timeouts or some other failure detector. In plain English, this means that even if the timings in the system are all over the place (processes pausing, networks delaying, clocks jumping forwards and backwards), the performance of an algorithm might go to hell, but the algorithm will never make an incorrect decision.

However, Redlock is not like this. Its safety depends on a lot of timing assumptions: it assumes that all Redis nodes hold keys for approximately the right length of time before expiring; that the network delay is small compared to the expiry duration; and that process pauses are much shorter than the expiry duration.

Breaking Redlock with bad timings

Let’s look at some examples to demonstrate Redlock’s reliance on timing assumptions. Say the system has five Redis nodes (A, B, C, D and E), and two clients (1 and 2). What happens if a clock on one of the Redis nodes jumps forward?

  1. Client 1 acquires lock on nodes A, B, C. Due to a network issue, D and E cannot be reached.
  2. The clock on node C jumps forward, causing the lock to expire.
  3. Client 2 acquires lock on nodes C, D, E. Due to a network issue, A and B cannot be reached.
  4. Clients 1 and 2 now both believe they hold the lock.

A similar issue could happen if C crashes before persisting the lock to disk, and immediately restarts. For this reason, the Redlock documentation recommends delaying restarts of crashed nodes for at least the time-to-live of the longest-lived lock. But this restart delay again relies on a reasonably accurate measurement of time, and would fail if the clock jumps.

Okay, so maybe you think that a clock jump is unrealistic, because you’re very confident in having correctly configured NTP to only ever slew the clock. In that case, let’s look at an example of how a process pause may cause the algorithm to fail:

  1. Client 1 requests lock on nodes A, B, C, D, E.
  2. While the responses to client 1 are in flight, client 1 goes into stop-the-world GC.
  3. Locks expire on all Redis nodes.
  4. Client 2 acquires lock on nodes A, B, C, D, E.
  5. Client 1 finishes GC, and receives the responses from Redis nodes indicating that it successfully acquired the lock (they were held in client 1’s kernel network buffers while the process was paused).
  6. Clients 1 and 2 now both believe they hold the lock.

Note that even though Redis is written in C, and thus doesn’t have GC, that doesn’t help us here: any system in which the clients may experience a GC pause has this problem. You can only make this safe by preventing client 1 from performing any operations under the lock after client 2 has acquired the lock, for example using the fencing approach above.

A long network delay can produce the same effect as the process pause. It perhaps depends on your TCP user timeout – if you make the timeout significantly shorter than the Redis TTL, perhaps the delayed network packets would be ignored, but we’d have to look in detail at the TCP implementation to be sure. Also, with the timeout we’re back down to accuracy of time measurement again!

The synchrony assumptions of Redlock

These examples show that Redlock works correctly only if you assume a synchronous system model – that is, a system with the following properties:

  • bounded network delay (you can guarantee that packets always arrive within some guaranteed maximum delay),
  • bounded process pauses (in other words, hard real-time constraints, which you typically only find in car airbag systems and suchlike), and
  • bounded clock error (cross your fingers that you don’t get your time from a bad NTP server).

Note that a synchronous model does not mean exactly synchronised clocks: it means you are assuming a known, fixed upper bound on network delay, pauses and clock drift [12]. Redlock assumes that delays, pauses and drift are all small relative to the time-to-live of a lock; if the timing issues become as large as the time-to-live, the algorithm fails.

In a reasonably well-behaved datacenter environment, the timing assumptions will be satisfied most of the time – this is known as a partially synchronous system [12]. But is that good enough? As soon as those timing assumptions are broken, Redlock may violate its safety properties, e.g. granting a lease to one client before another has expired. If you’re depending on your lock for correctness, “most of the time” is not enough – you need it to always be correct.

There is plenty of evidence that it is not safe to assume a synchronous system model for most practical system environments [7,8]. Keep reminding yourself of the GitHub incident with the 90-second packet delay. It is unlikely that Redlock would survive a Jepsen test.

On the other hand, a consensus algorithm designed for a partially synchronous system model (or asynchronous model with failure detector) actually has a chance of working. Raft, Viewstamped Replication, Zab and Paxos all fall in this category. Such an algorithm must let go of all timing assumptions. That’s hard: it’s so tempting to assume networks, processes and clocks are more reliable than they really are. But in the messy reality of distributed systems, you have to be very careful with your assumptions.

Conclusion

I think the Redlock algorithm is a poor choice because it is “neither fish nor fowl”: it is unnecessarily heavyweight and expensive for efficiency-optimization locks, but it is not sufficiently safe for situations in which correctness depends on the lock.

In particular, the algorithm makes dangerous assumptions about timing and system clocks (essentially assuming a synchronous system with bounded network delay and bounded execution time for operations), and it violates safety properties if those assumptions are not met. Moreover, it lacks a facility for generating fencing tokens (which protect a system against long delays in the network or in paused processes).

If you need locks only on a best-effort basis (as an efficiency optimization, not for correctness), I would recommend sticking with the straightforward single-node locking algorithm for Redis (conditional set-if-not-exists to obtain a lock, atomic delete-if-value-matches to release a lock), and documenting very clearly in your code that the locks are only approximate and may occasionally fail. Don’t bother with setting up a cluster of five Redis nodes.

On the other hand, if you need locks for correctness, please don’t use Redlock. Instead, please use a proper consensus system such as ZooKeeper, probably via one of the Curator recipes that implements a lock. (At the very least, use a database with reasonable transactional guarantees.) And please enforce use of fencing tokens on all resource accesses under the lock.

As I said at the beginning, Redis is an excellent tool if you use it correctly. None of the above diminishes the usefulness of Redis for its intended purposes. Salvatore has been very dedicated to the project for years, and its success is well deserved. But every tool has limitations, and it is important to know them and to plan accordingly.

If you want to learn more, I explain this topic in greater detail in chapters 8 and 9 of my book, now available in Early Release from O’Reilly. (The diagrams above are taken from my book.) For learning how to use ZooKeeper, I recommend Junqueira and Reed’s book [3]. For a good introduction to the theory of distributed systems, I recommend Cachin, Guerraoui and Rodrigues’ textbook [13].

Thank you to Kyle KingsburyCamille FournierFlavio Junqueira, and Salvatore Sanfilippo for reviewing a draft of this article. Any errors are mine, of course.

Update 9 Feb 2016: Salvatore, the original author of Redlock, has posted a rebuttal to this article (see also HN discussion). He makes some good points, but I stand by my conclusions. I may elaborate in a follow-up post if I have time, but please form your own opinions – and please consult the references below, many of which have received rigorous academic peer review (unlike either of our blog posts).

References

[1] Cary G Gray and David R Cheriton: “Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency,” at 12th ACM Symposium on Operating Systems Principles (SOSP), December 1989. doi:10.1145/74850.74870

[2] Mike Burrows: “The Chubby lock service for loosely-coupled distributed systems,” at 7th USENIX Symposium on Operating System Design and Implementation (OSDI), November 2006.

[3] Flavio P Junqueira and Benjamin Reed: ZooKeeper: Distributed Process Coordination. O’Reilly Media, November 2013. ISBN: 978-1-4493-6130-3

[4] Enis Söztutar: “HBase and HDFS: Understanding filesystem usage in HBase,” at HBaseCon, June 2013.

[5] Todd Lipcon: “Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1,” blog.cloudera.com, 24 February 2011.

[6] Martin Thompson: “Java Garbage Collection Distilled,” mechanical-sympathy.blogspot.co.uk, 16 July 2013.

[7] Peter Bailis and Kyle Kingsbury: “The Network is Reliable,” ACM Queue, volume 12, number 7, July 2014. doi:10.1145/2639988.2639988

[8] Mark Imbriaco: “Downtime last Saturday,” github.com, 26 December 2012.

[9] Tushar Deepak Chandra and Sam Toueg: “Unreliable Failure Detectors for Reliable Distributed Systems,” Journal of the ACM, volume 43, number 2, pages 225–267, March 1996. doi:10.1145/226643.226647

[10] Michael J Fischer, Nancy Lynch, and Michael S Paterson: “Impossibility of Distributed Consensus with One Faulty Process,” Journal of the ACM, volume 32, number 2, pages 374–382, April 1985. doi:10.1145/3149.214121

[11] Maurice P Herlihy: “Wait-Free Synchronization,” ACM Transactions on Programming Languages and Systems, volume 13, number 1, pages 124–149, January 1991. doi:10.1145/114005.102808

[12] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “Consensus in the Presence of Partial Synchrony,” Journal of the ACM, volume 35, number 2, pages 288–323, April 1988. doi:10.1145/42282.42283

[13] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues: Introduction to Reliable and Secure Distributed Programming, Second Edition. Springer, February 2011. ISBN: 978-3-642-15259-7, doi:10.1007/978-3-642-15260-3

 


Kotlin 开发者社区

国内第一Kotlin 开发者社区公众号,主要分享、交流 Kotlin 编程语言、Spring Boot、Android、React.js/Node.js、函数式编程、编程思想等相关主题。