1. IR-book 经典教材中的思路

Faster postings list intersection via skip pointers 通过跳指针，实现更快的 posting list 求交 https://nlp.stanford.edu/IR-book/html/htmledition/faster-postings-list-intersection-via-skip-pointers-1.html

首先有个最 trival 的算法，两个 posting list，各一个指针，谁不匹配，就把谁右移一个，

跳表方法，思路类似 skip list，是要在建索引的时候，额外构建 skip list，这样就不是右移一个，而是按照跳表来右移。

2. lemire 教授的 SIMDCompressionAndIntersection 库中有多种算法

https://github.com/lemire/SIMDCompressionAndIntersection/blob/master/src/intersection.cpp

可以看到实现了这些求交算法，并且还有个 benchmark 程序 benchintersection 可以做对比

“simd”
“galloping”
“scalar”
“v1”
“v3”
“simdgalloping”
“simd_avx2”
“v1_avx2”
“v3_avx2”
“simdgalloping_avx2”
“highlyscalable_intersect_SIMD”
“lemire_highlyscalable_intersect_SIMD”

其中根据输入数据的规模，会选用不同的算法，并且还有 avx2 指令集的优化算法。

其中提到的 sse 优化的 fast-intersection-sorted-lists-sse https://highlyscalable.wordpress.com/2012/06/05/fast-intersection-sorted-lists-sse/

3. “The magic of Algorithms! " 中的算法

Chap. 6, Algorithm 6.1 Intersection based on Mutual Partitioning

http://didawiki.di.unipi.it/doku.php/magistraleinformaticanetworking/ae/ae2019/start#books_notes_etc

pdf 是 https://www.dropbox.com/s/dqttqigvuf4ydkd/main.pdf?dl=0

思路很简单，就是嵌套的 binary search，实测效果不错，而且实现简单，可以直接递归实现。

我实测效果还不错，提了个 issue https://github.com/lemire/SIMDCompressionAndIntersection/pull/29

4. List Intersection for Web Search: Algorithms, Cost Models, and Optimizations

http://www.vldb.org/pvldb/vol12/p1-kim.pdf 微软的文章，搞了个 cost model，来评估各种 list 求交集算法的性能，但是不开源。

本文提出一种基于代价的模型，来估算各种算法组合的代价，并使用代价模型来搜索代价最低的算法，得到的计划通常是 2-way 算法的一个组合，并且比传统的 2-way 和 k-way 算法性能都好。

并列举了常见的 List Intersection 算法： 2-way:

2-Gallop
2-Merge
2-SIMD

k-way:

k-Gallop
k-Merge

2-way vs. k-way 本文断言，对 k-list 求交，通过 k-1 次重复 2-way 求交算法，比通过单次 k-way 求交算法，更快。

结论，通过本文 cost model optimizer 规划出来的 2-way 求交计算，通常比 k-way 求交更快。

https://arxiv.org/pdf/1401.6399.pdf

We demonstrated that combining unpacking and differential coding resulted in faster decompression speeds, which were approximately 30 % better than the best speeds reported previously [3]. To match the performance of these fast compression schemes, we additionally vectorized and optimized the intersection of posting lists. To this end, we introduced a family of algorithms exploiting commonly available SIMD instructions (V1, V3 and SIMD GALLOPING). They are often twice as fast as the best non-SIMD algorithms. Then, we used our fast SIMD routines for decompression and posting

Efficient set intersection for inverted indexing

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.415.2636&rep=rep1&type=pdf

Intersection in integer inverted indices https://pdfs.semanticscholar.org/db30/9935b512ef4d0c8941c96e987ad70400ebde.pdf

lookup 算法最快，非常简单，两层存储，按位分成两层。首层直接定位，二层直接遍历。二层可以压缩。

倒排索引求交算法相关资料调研

文章目录

1. IR-book 经典教材中的思路

2. lemire 教授的 SIMDCompressionAndIntersection 库中有多种算法

3. “The magic of Algorithms! " 中的算法

4. List Intersection for Web Search: Algorithms, Cost Models, and Optimizations