您现在的位置是：首页 > Java

当前栏目

Java|存储|Guava Bloom Filter源码剖析

JAVA 存储源码

2023-03-15 22:02:22 时间

Bloom Filter(布隆过滤器)以牺牲少量正确率为代价，利用较少的空间实现O(1)的查询，在LSM Tree、Cache中作为常见的读优化手段。本文结合谷歌的Guava源码介绍Bloom Filter的实现。

算法

我们可以回忆一下bitmap是如何进行查找的，本质上，bitmap就是对于元素进行了

的哈希映射，利用一个bit判断元素是否位于集合中。

布隆过滤器实质上以增加时间开销为代价，节约空间。就是通过多个哈希映射，

，利用一组bit判断元素是否位于集合中。

因为允许哈希冲突，所以布隆过滤器会有一定的误判率。

Put

public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);
      boolean bitsChanged = false;
      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        bitsChanged |= bits.set(combinedHash % bitSize);
      }
      return bitsChanged;
    }

Less Hashing, Same Performance: Building a Better Bloom Filter

谷歌使用了哈佛这篇论文的结论，利用两个32位的hash函数即可映射到n个bit上。

首先将object通过funnel转换为基本类型，计算出64位hash，并且将高位低位分别作为hash。这样一来只调用了一次hash函数，大大节约了时间开销。

Hint: 联想一下算法导论散列表那一节，对于开放地址法，算法导论提出了线性探查、二次探查、双重散列等几种探查方法，以双重散列为冲突最小，这里的思路其实就是双重散列！联想一下上面提到的因为hash冲突而导致误判，那么减少冲突其实是自然而然的。

那么我们发现，本质上，这个布隆过滤器其实就是向散列表中插入了numHashFunctions个相同的对象，而散列表恰好采用了开放地址法+双重散列罢了。

mightContain

 @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        if (!bits.get(combinedHash % bitSize)) {
          return false;
        }
      }
      return true;
    }

思路和put是一样的，如果任意一个映射的bit为0，则元素miss，只有全部为1才认为元素可能存在。但是因为存在hash冲突，所以并不是100%存在。

参数

参数能够自动根据需求计算

hash数计算

给定元素数目和布隆过滤器bit数目，计算布隆过滤器hash函数数目。

  // Cheat sheet:
  //
  // m: total bits
  // n: expected insertions
  // b: m/n, bits per insertion
  // p: expected false positive probability
  //
  // 1) Optimal k = b * ln2
  // 2) p = (1 - e ^ (-kn/m))^k
  // 3) For optimal k: p = 2 ^ (-k) ~= 0.6185^b
  // 4) For optimal k: m = -nlnp / ((ln2) ^ 2)
  /**
   * Computes the optimal k (number of hashes per element inserted in Bloom filter), given the
   * expected insertions and total number of bits in the Bloom filter.
   *
   * <p>See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula.
   *
   * @param n expected insertions (must be positive)
   * @param m total number of bits in Bloom filter (must be positive)
   */
  @VisibleForTesting
  static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

bit数计算

给定元素数目和误判率，计算布隆过滤器bit数目。

/**
   * Computes m (total bits of Bloom filter) which is expected to achieve, for the specified
   * expected insertions, the required false positive probability.
   *
   * <p>See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the
   * formula.
   *
   * @param n expected insertions (must be positive)
   * @param p false positive rate (must be 0 < p < 1)
   */
  @VisibleForTesting
  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

Compatible

对于所有参数都相同的布隆过滤器，可以进行位运算。使得两个布隆过滤器的bit数组取并集，这样就能结合两个集合的信息了。

 public boolean isCompatible(BloomFilter<T> that) {
    checkNotNull(that);
    return this != that
        && this.numHashFunctions == that.numHashFunctions
        && this.bitSize() == that.bitSize()
        && this.strategy.equals(that.strategy)
        && this.funnel.equals(that.funnel);
  }
  public void putAll(BloomFilter<T> that) {
    checkNotNull(that);
    checkArgument(this != that, "Cannot combine a BloomFilter with itself.");
    checkArgument(
        this.numHashFunctions == that.numHashFunctions,
        "BloomFilters must have the same number of hash functions (%s != %s)",
        this.numHashFunctions,
        that.numHashFunctions);
    checkArgument(
        this.bitSize() == that.bitSize(),
        "BloomFilters must have the same size underlying bit arrays (%s != %s)",
        this.bitSize(),
        that.bitSize());
    checkArgument(
        this.strategy.equals(that.strategy),
        "BloomFilters must have equal strategies (%s != %s)",
        this.strategy,
        that.strategy);
    checkArgument(
        this.funnel.equals(that.funnel),
        "BloomFilters must have equal funnels (%s != %s)",
        this.funnel,
        that.funnel);
    this.bits.putAll(that.bits);
  }

剩下的大部分和序列化有关，无视之。

猜你喜欢

放弃了 7 年的 Java，投身互联网做 PHP，我是如何成为创业公司的 CTO？
浅析Python的类、继承和多态
如何使用React创建视频和动画
【干货】Python自动化审计及实现
前端浏览器缓存知识梳理
如何用Python执行常见的Excel和SQL任务
面试官：说说你对Node.js 的理解？优缺点？应用场景？
手把手教你用 Python 和 Scikit-Learn 实现垃圾邮件过滤
为什么源码中都使用16进制进行状态管理？
技术 | 使用Python来学习数据科学的完整教程
从零开始教你用Python做词云
一些你可能还不知事件技巧– Vue3更新
Python Web框架介绍
从环境设置到内存分析：Python代码优化指南
新时代创意布局不完全指南
为什么说用 Babel 编译 Typescript 是更好的选择
前端开发中这些小技巧，你知道几个？
神经网络理论基础及Python实现
JavaScript 中如何实现大文件并发上传？
Python调用C模块以及性能分析

zl程序教程

当前栏目

Java|存储|Guava Bloom Filter源码剖析

算法

Put

mightContain

参数

Compatible

相关文章