zl程序教程

您现在的位置是:首页 >  Java

当前栏目

Java|存储|Guava Bloom Filter源码剖析

2023-03-15 22:02:22 时间

Bloom Filter(布隆过滤器)以牺牲少量正确率为代价,利用较少的空间实现O(1)的查询,在LSM Tree、Cache中作为常见的读优化手段。本文结合谷歌的Guava源码介绍Bloom Filter的实现。

算法

我们可以回忆一下bitmap是如何进行查找的,本质上,bitmap就是对于元素进行了

的哈希映射,利用一个bit判断元素是否位于集合中。

布隆过滤器实质上以增加时间开销为代价,节约空间。就是通过多个哈希映射,

,利用一组bit判断元素是否位于集合中。

因为允许哈希冲突,所以布隆过滤器会有一定的误判率。

Put

public <T> boolean put(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);
      boolean bitsChanged = false;
      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        bitsChanged |= bits.set(combinedHash % bitSize);
      }
      return bitsChanged;
    }

Less Hashing, Same Performance: Building a Better Bloom Filter

谷歌使用了哈佛这篇论文的结论,利用两个32位的hash函数即可映射到n个bit上。

首先将object通过funnel转换为基本类型,计算出64位hash,并且将高位低位分别作为hash。这样一来只调用了一次hash函数,大大节约了时间开销。

Hint: 联想一下算法导论散列表那一节,对于开放地址法,算法导论提出了线性探查、二次探查、双重散列等几种探查方法,以双重散列为冲突最小,这里的思路其实就是双重散列!联想一下上面提到的因为hash冲突而导致误判,那么减少冲突其实是自然而然的。

那么我们发现,本质上,这个布隆过滤器其实就是向散列表中插入了numHashFunctions个相同的对象,而散列表恰好采用了开放地址法+双重散列罢了。

mightContain

 @Override
    public <T> boolean mightContain(
        T object, Funnel<? super T> funnel, int numHashFunctions, LockFreeBitArray bits) {
      long bitSize = bits.bitSize();
      long hash64 = Hashing.murmur3_128().hashObject(object, funnel).asLong();
      int hash1 = (int) hash64;
      int hash2 = (int) (hash64 >>> 32);

      for (int i = 1; i <= numHashFunctions; i++) {
        int combinedHash = hash1 + (i * hash2);
        // Flip all the bits if it's negative (guaranteed positive number)
        if (combinedHash < 0) {
          combinedHash = ~combinedHash;
        }
        if (!bits.get(combinedHash % bitSize)) {
          return false;
        }
      }
      return true;
    }

思路和put是一样的,如果任意一个映射的bit为0,则元素miss,只有全部为1才认为元素可能存在。但是因为存在hash冲突,所以并不是100%存在。

参数

参数能够自动根据需求计算

hash数计算

给定元素数目和布隆过滤器bit数目,计算布隆过滤器hash函数数目。

  // Cheat sheet:
  //
  // m: total bits
  // n: expected insertions
  // b: m/n, bits per insertion
  // p: expected false positive probability
  //
  // 1) Optimal k = b * ln2
  // 2) p = (1 - e ^ (-kn/m))^k
  // 3) For optimal k: p = 2 ^ (-k) ~= 0.6185^b
  // 4) For optimal k: m = -nlnp / ((ln2) ^ 2)
  /**
   * Computes the optimal k (number of hashes per element inserted in Bloom filter), given the
   * expected insertions and total number of bits in the Bloom filter.
   *
   * <p>See http://en.wikipedia.org/wiki/File:Bloom_filter_fp_probability.svg for the formula.
   *
   * @param n expected insertions (must be positive)
   * @param m total number of bits in Bloom filter (must be positive)
   */
  @VisibleForTesting
  static int optimalNumOfHashFunctions(long n, long m) {
    // (m / n) * log(2), but avoid truncation due to division!
    return Math.max(1, (int) Math.round((double) m / n * Math.log(2)));
  }

bit数计算

给定元素数目和误判率,计算布隆过滤器bit数目。

/**
   * Computes m (total bits of Bloom filter) which is expected to achieve, for the specified
   * expected insertions, the required false positive probability.
   *
   * <p>See http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives for the
   * formula.
   *
   * @param n expected insertions (must be positive)
   * @param p false positive rate (must be 0 < p < 1)
   */
  @VisibleForTesting
  static long optimalNumOfBits(long n, double p) {
    if (p == 0) {
      p = Double.MIN_VALUE;
    }
    return (long) (-n * Math.log(p) / (Math.log(2) * Math.log(2)));
  }

Compatible

对于所有参数都相同的布隆过滤器,可以进行位运算。使得两个布隆过滤器的bit数组取并集,这样就能结合两个集合的信息了。

 public boolean isCompatible(BloomFilter<T> that) {
    checkNotNull(that);
    return this != that
        && this.numHashFunctions == that.numHashFunctions
        && this.bitSize() == that.bitSize()
        && this.strategy.equals(that.strategy)
        && this.funnel.equals(that.funnel);
  }
  public void putAll(BloomFilter<T> that) {
    checkNotNull(that);
    checkArgument(this != that, "Cannot combine a BloomFilter with itself.");
    checkArgument(
        this.numHashFunctions == that.numHashFunctions,
        "BloomFilters must have the same number of hash functions (%s != %s)",
        this.numHashFunctions,
        that.numHashFunctions);
    checkArgument(
        this.bitSize() == that.bitSize(),
        "BloomFilters must have the same size underlying bit arrays (%s != %s)",
        this.bitSize(),
        that.bitSize());
    checkArgument(
        this.strategy.equals(that.strategy),
        "BloomFilters must have equal strategies (%s != %s)",
        this.strategy,
        that.strategy);
    checkArgument(
        this.funnel.equals(that.funnel),
        "BloomFilters must have equal funnels (%s != %s)",
        this.funnel,
        that.funnel);
    this.bits.putAll(that.bits);
  }

剩下的大部分和序列化有关,无视之。