zl程序教程

您现在的位置是:首页 >  其他

当前栏目

go-redis版本不一致导致CPU使用率问题

2023-02-18 16:34:44 时间

大家好,iasher,fighting。

最近遇到了奇怪的问题(又),同样的项目,部署到不同的环境后,CPU使用率差了近7倍,本着努力学习,提升自我,实现自我价值的原则。必须得搞清楚为啥啊。

定位

直接祭出go大杀器pprof:https://pkg.go.dev/net/http/pprof

查看cpu使用情况:


Type: cpu
Time: Jun 22, 2022 at 5:01pm (CST)
Duration: 30s, Total samples = 3.30s (11.00%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 2180ms, 66.06% of 3300ms total
Dropped 159 nodes (cum <= 16.50ms)
Showing top 10 nodes out of 242
      flat  flat%   sum%        cum   cum%
     700ms 21.21% 21.21%      700ms 21.21%  runtime.futex
     600ms 18.18% 39.39%      800ms 24.24%  syscall.Syscall
     260ms  7.88% 47.27%      260ms  7.88%  runtime.nanotime (inline)
     140ms  4.24% 51.52%      140ms  4.24%  runtime.epollwait
     120ms  3.64% 55.15%      120ms  3.64%  time.Now
     110ms  3.33% 58.48%      110ms  3.33%  syscall.RawSyscall
      80ms  2.42% 60.91%      230ms  6.97%  runtime.scanobject
      70ms  2.12% 63.03%      170ms  5.15%  syscall.Syscall6
      50ms  1.52% 64.55%       60ms  1.82%  runtime.heapBitsSetType
      50ms  1.52% 66.06%       50ms  1.52%  runtime.markBits.isMarked (inline)
(pprof) list runtime.markBits.isMarked

为什么系统调用这么频繁呢?分析profile文件(本地分析):

go tool pprof -http 127.0.0.1:3001 /Users/admin/Downloads/pprof.samples.cpu.004.pb.gz

flame graph:

基本上syscall都是操作redis集群发起的,这是为啥咧?

在纳闷的时候,看到了曹大的一篇文章:go-redis 和 redis server 版本错位导致的高延时问题一例(https://xargin.com/go-redis-v6-and-redis-server-6-are-not-compatible/)

因此猜测此问题是不是也是因为go-redis和redis集群版本不一致导致的呢?然后确定了下各个环境的redis集群的版本,发现redis集群有redis5和redis6版本。CPU使用率较高的问题是redis6集群的机器发生的。

那么很可能cpu使用率差异较高的问题是因为go-redis和redis集群版本不对称造成的。

测试

1.安装redis5.*和redis6.* server

下载地址: https://redis.io/download/#redis-downloads

此时有两个redis服务目录:

redis-5.0.14
redis-6.2.7

解压后分别进入目录,执行 make命令

启动集群,分别进入对应目录执行以下命令:

cd ./redis-6.2.7/utils/create-cluster
./create-cluster start
./create-cluster create

查看集群信息:

localhost:create-cluster admin$ redis-cli -c -p 30001

127.0.0.1:30001> cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:6
cluster_size:3
cluster_current_epoch:6
cluster_my_epoch:1
cluster_stats_messages_ping_sent:12454
cluster_stats_messages_pong_sent:12434
cluster_stats_messages_sent:24888
cluster_stats_messages_ping_received:12429
cluster_stats_messages_pong_received:12454
cluster_stats_messages_meet_received:5
cluster_stats_messages_received:24888

默认创建了3主3从的集群

2.准备测试文件

核心代码和曹大文章中的没什么区别

redis-v6版本:

package main

import (
 "fmt"
 "log"
 "net/http"

 _ "net/http/pprof"

 "github.com/go-redis/redis"
)

var cli = redis.NewClusterClient(&redis.ClusterOptions{
 Addrs: []string{"127.0.0.1:30001", "127.0.0.1:30002", "127.0.0.1:30003"},
})

func hashSet(w http.ResponseWriter, r *http.Request) {
 ret := cli.HSet("key1", "12345", "key1val")
 if ret.Err() != nil {
  fmt.Println(ret.Err())
 }
}

func main() {
 http.HandleFunc("/", hashSet)
 err := http.ListenAndServe(":3003", nil)
 if err != nil {
  log.Fatal("listen err:", err)
 }
}

redis-v8版本:

package main

import (
 "context"
 "fmt"
 "log"
 "net/http"

 _ "net/http/pprof"

 "github.com/go-redis/redis/v8"
)

var cli = redis.NewClusterClient(&redis.ClusterOptions{
 Addrs: []string{"127.0.0.1:30001", "127.0.0.1:30002", "127.0.0.1:30003"},
})

func hashSet(w http.ResponseWriter, r *http.Request) {
 ret := cli.HSet(context.TODO(), "key1", "12345", "key1val")
 if ret.Err() != nil {
  fmt.Println(ret.Err())
 }
}

func main() {
 http.HandleFunc("/", hashSet)
 err := http.ListenAndServe(":3003", nil)
 if err != nil {
  log.Fatal("listen err:", err)
 }
}

3.进行测试

使用wrk测试,wrk安装如下:

地址:https://github.com/wg/wrk
下载后执行make即可在当前目录生成wrk可执行文件

然后分别进行测试,最后测试结果如下:

wrk -t10 -c100 -d60 http://localhost:3003

go-redis Client

Server

平均延迟

qps

v6

5.0.14

10ms

13973.76

v8

5.0.14

10ms

12142.28

v6

6.2.7

1.34s

73.25

v8

6.2.7

9.92ms

13042.41

从测试数据看,在sdk和redis集群版本不对称的go-redis v6 -> redis cluster 6.2.7中延迟明显增高。

问题分析

分析下 go-redis v6源码:

func hashSet(w http.ResponseWriter, r *http.Request) {
  ret := cli.HSet("key1", "12345", "key1val")
  if ret.Err() != nil {
    fmt.Println(ret.Err())
  }
}

执行HSet命令的是 c.process, c.porcess是在连接集群的时候赋值的:

var cli = redis.NewClusterClient(&redis.ClusterOptions{
  Addrs: []string{"127.0.0.1:30001", "127.0.0.1:30002", "127.0.0.1:30003"},
})

func NewClusterClient(opt *ClusterOptions) *ClusterClient {
  ...
  c.cmdsInfoCache = newCmdsInfoCache(c.cmdsInfo)

  c.process = c.defaultProcess
  ...
  return c
}

defaultProcess:获取集群节点以及槽信息

func (c *ClusterClient) defaultProcess(cmd Cmder) error {
 var node *clusterNode
  ...
  if node == nil {
   var err error
   # 获取集群节点以及槽信息
   _, node, err = c.cmdSlotAndNode(cmd)
   if err != nil {
    cmd.setErr(err)
    break
   }
  }

     ...
 return cmd.Err()
}

cmdSlotAndNode:

func (c *ClusterClient) cmdSlotAndNode(cmd Cmder) (int, *clusterNode, error) {
  state, err := c.state.Get()
  if err != nil {
    return 0, nil, err
  }
    # 获取cmdInfo
  cmdInfo := c.cmdInfo(cmd.Name())
  slot := c.cmdSlot(cmd)

  ...
  return slot, node, err
}

cmdInfo:最终从cache中获取cmdInfo

func (c *ClusterClient) cmdInfo(name string) *CommandInfo {
    # 最终从cache中获取cmdInfo
  cmdsInfo, err := c.cmdsInfoCache.Get()
  if err != nil {
    return nil
  }
    ...
}

cmdInfo也是在初始化连接的时候赋值的:

func NewClusterClient(opt *ClusterOptions) *ClusterClient {
  opt.init()

  c := &ClusterClient{
    opt:   opt,
    nodes: newClusterNodes(opt),
  }
  
  # 初始化cmdInfoCache
  c.cmdsInfoCache = newCmdsInfoCache(c.cmdsInfo)
    ...
  return c
}

在执行cmd相对的命令的时候,调用了初始化时指定的方法: c.cmdsInfo

func (c *ClusterClient) cmdsInfo() (map[string]*CommandInfo, error) {
  addrs, err := c.nodes.Addrs()
  if err != nil {
    return nil, err
  }

  var firstErr error
  for _, addr := range addrs {
    node, err := c.nodes.Get(addr)
    if err != nil {
      return nil, err
    }
    if node == nil {
      continue
    }

    info, err := node.Client.Command().Result()
    if err == nil {
      return info, nil
    }
  
  }
  return nil, firstErr
}

也就是说,最终是 cmdsInfoCache.Get()方法执行 cmdsInfo方法, cmdsInfo方法向任意节点发起command,有节点回应则返回节点信息

func (c *cmdsInfoCache) Get() (map[string]*CommandInfo, error) {
  err := c.once.Do(func() error {
    cmds, err := c.fn()
    if err != nil {
      return err
    }
    c.cmds = cmds
    return nil
  })
  return c.cmds, err
}

获取到节点信息后,保存到 cmdsInfoCache中, once.Do是要加锁的,用曹大的话说:函数返回的 err 是空的话,cmdsInfoCache 就初始化完成了,根据 sync.Once 的性质,下次所有命令的 once.Do 就只是简单执行一个 atomic.Load,成本不高。

但是根据实际情况看,此处的执行成本依然是最高的,即err是不为空的。 once.Do一直在加锁、解锁中,即在获取连接的时候是返回错误的。

所以此处看看实际发生了什么错误呢?

cmdsInfo,err map[] redis: got 7 elements in COMMAND reply, wanted 6
cmdsInfo,err map[] redis: got 7 elements in COMMAND reply, wanted 6
cmdsInfo,err map[] redis: got 7 elements in COMMAND reply, wanted 6
cmdsInfo,err map[] redis: got 7 elements in COMMAND reply, wanted 6

也就说,go-redis v6解析redis cluster6.*版本的返回信息,会发生错误。

引用曹大的话:

每次 parse 都出错,那自然每次 once.Do 都会进 slow path 了,redis cluster 的 client 是全局公用,所以这里的锁是个全局锁,并且锁内有较慢的网络调用。

具体原因:Fix performance degradation due to non-caching of command info in Redis 6 Cluster #1355(https://github.com/go-redis/redis/pull/1355)

那为什么返回错误了,最后存取数据却可以呢?

cmdSlotAndNode方法可以看到,如果cmdsInfoCache没有cmdInfo缓存,则直接根据slot找到对应的node,然后继续后续操作。

解决问题

定位到问题产生的原因后,解决方案是升级go-redis即可

最后

可以看看go-redis v8版本如何解决这个问题的:

github: https://github.com/go-redis/redis/compare/support-commandsinfo-acl-flags...yuuki:support-commandsinfo-acl-flags

v6版本解析redis返回命令:

func commandInfoParser(rd *proto.Reader, n int64) (interface{}, error) {
  var cmd CommandInfo
  var err error

  if n != 6 {
    return nil, fmt.Errorf("redis: got %d elements in COMMAND reply, wanted 6", n)
  }

  cmd.Name, err = rd.ReadString()
  if err != nil {
    return nil, err
  }

    ...
}

v8版本:

func commandInfoParser(rd *proto.Reader, n int64) (interface{}, error) {
  const numArgRedis5 = 6
  const numArgRedis6 = 7

  switch n {
  case numArgRedis5, numArgRedis6:
    // continue
  default:
    return nil, fmt.Errorf("redis: got %d elements in COMMAND reply, wanted 7", n)
  }
    ...
  _, err = rd.ReadReply(func(rd *proto.Reader, n int64) (interface{}, error) {
    cmd.Flags = make([]string, n)
    for i := 0; i < len(cmd.Flags); i++ {
      switch s, err := rd.ReadString(); {
      case err == Nil:
        cmd.Flags[i] = ""
      case err != nil:
        return nil, err
      default:
        cmd.Flags[i] = s
      }
    }
    return nil, nil
  })
  ...
  if n == numArgRedis5 {
    return &cmd, nil
  }

  _, err = rd.ReadReply(func(rd *proto.Reader, n int64) (interface{}, error) {
    cmd.ACLFlags = make([]string, n)
    for i := 0; i < len(cmd.ACLFlags); i++ {
      switch s, err := rd.ReadString(); {
      case err == Nil:
        cmd.ACLFlags[i] = ""
      case err != nil:
        return nil, err
      default:
        cmd.ACLFlags[i] = s
      }
    }
    return nil, nil
  })
  if err != nil {
    return nil, err
  }

  return &cmd, nil
}