1. IR-book 经典教材中的思路

Faster postings list intersection via skip pointers 通过跳指针,实现更快的 posting list 求交 https://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html

首先有个最 trival 的算法,两个 posting list,各一个指针,谁不匹配,就把谁右移一个,

跳表方法,思路类似 skip list,是要在建索引的时候,额外构建 skip list,这样就不是右移一个,而是按照跳表来右移。

2. lemire 教授的 SIMDCompressionAndIntersection 库中有多种算法

https://github.com/lemire/SIMDCompressionAndIntersection/blob/master/src/intersection.cpp

可以看到实现了这些求交算法,并且还有个 benchmark 程序 benchintersection 可以做对比

  • “simd”
  • “galloping”
  • “scalar”
  • “v1”
  • “v3”
  • “simdgalloping”
  • “simd_avx2”
  • “v1_avx2”
  • “v3_avx2”
  • “simdgalloping_avx2”
  • “highlyscalable_intersect_SIMD”
  • “lemire_highlyscalable_intersect_SIMD”

其中根据输入数据的规模,会选用不同的算法,并且还有 avx2 指令集的优化算法。

其中提到的 sse 优化的 fast-intersection-sorted-lists-sse https://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/

3. “The magic of Algorithms! " 中的算法

Chap. 6, Algorithm 6.1 Intersection based on Mutual Partitioning

http://didawiki.di.unipi.it/doku.php/magistraleinformaticanetworking/ae/ae2019/start#books_notes_etc

pdf 是 https://www.dropbox.com/s/dqttqigvuf4ydkd/main.pdf?dl=0

思路很简单,就是嵌套的 binary search,实测效果不错,而且实现简单,可以直接递归实现。

我实测效果还不错,提了个 issue https://github.com/lemire/SIMDCompressionAndIntersection/pull/29

4. List Intersection for Web Search: Algorithms, Cost Models, and Optimizations

http://www.vldb.org/pvldb/vol12/p1-kim.pdf 微软的文章,搞了个 cost model, 来评估各种 list 求交集算法的性能,但是不开源。

本文提出一种基于代价的模型,来估算各种算法组合的代价,并使用代价模型来搜索代价最低的算法,得到的计划通常是 2-way 算法的一个组合, 并且比传统的 2-way 和 k-way 算法性能都好。

并列举了常见的 List Intersection 算法: 2-way:

  1. 2-Gallop
  2. 2-Merge
  3. 2-SIMD

k-way:

  1. k-Gallop
  2. k-Merge

2-way vs. k-way 本文断言,对 k-list 求交,通过 k-1 次重复 2-way 求交算法,比通过单次 k-way 求交算法,更快。

结论,通过本文 cost model optimizer 规划出来的 2-way 求交计算,通常比 k-way 求交更快。

https://arxiv.org/pdf/1401.6399.pdf

We demonstrated that combining unpacking and differential coding resulted in faster decompression speeds, which were approximately 30 % better than the best speeds reported previously [3]. To match the performance of these fast compression schemes, we additionally vectorized and optimized the intersection of posting lists. To this end, we introduced a family of algorithms exploiting commonly available SIMD instructions (V1, V3 and SIMD GALLOPING). They are often twice as fast as the best non-SIMD algorithms. Then, we used our fast SIMD routines for decompression and posting

Efficient set intersection for inverted indexing

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.415.2636&rep=rep1&type=pdf

Intersection in integer inverted indices https://pdfs.semanticscholar.org/db30/9935b512ef4d0c8941c96e987ad70400ebde.pdf

lookup 算法最快,非常简单,两层存储,按位分成两层。 首层直接定位,二层直接遍历。 二层可以压缩。