Top-k Document Retrieval in Compressed Space
Document Type
Conference Proceeding
Publication Date
2025
Department
Department of Computer Science
Abstract
Let 𝓓 be a collection of D strings of total length n over an alphabet of size σ. We consider the so-called top-k document retrieval problem: given a short string P and an integer k, list the identifiers of k strings in 𝓓 most relevant to P, in decreasing order of relevance. Relevance may be a fixed value associated with the strings where P occurs, or the number of times P occurs in the strings. While RAM-optimal solutions using O (n log n ) bits and O (|P|/logσ n + k ) time exist, solving the problem optimally within space close to O (n log σ ) bits is open. We describe a data structure for the top-k document retrieval problem that uses O (log log n ) bits per symbol on top of any compressed suffix array (CSA) of 𝓓, and supports queries in essentially optimal time, in the following sense. Given a CSA using |CSA| bits of space, that finds the suffix array range of a query string P in time tcnt, and accesses a suffix array entry in time tSA, listing any k pattern occurrences would take time O (tcnt + ktSA). Our top-k data structure uses | CSA | + O (n log log n ) bits and reports k most relevant documents that contain P in time O (tcnt + k (tSA + log log n )). On every known CSA using O (n log σ ) bits, tSA is Ω(log log n ) in virtually all cases, thus our time is O (tcnt + ktSA ) in most situations. When the query string P is sufficiently long, some CSAs reach time O (tcnt + k ) to list any k occurrences of P. Our structure achieves similar performance in this case, obtaining time O (tcnt + tsort(k, n )) on every known CSA, where tsort (k, n ) is the time to sort k integers in [1, n]. This time is already O (tcnt + k ) in the typical regimes, k = O (polylog n ) and k = Ω(nε) for any constant ε > 0. If we can deliver the results in unsorted order of relevance, then the time for long patterns is always O (tcnt + k ), which is optimal with respect to the CSA, and reaches the RAM-optimal time O (|P|/logσ n + k ) on a particular CSA. No top-k solution using o (n log D ) bits of space has achieved this before.
Publication Title
Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)
Recommended Citation
Navarro, G.,
&
Nekrich, Y.
(2025).
Top-k Document Retrieval in Compressed Space.
Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 4009-4030.
http://doi.org/10.1137/1.9781611978322.137
Retrieved from: https://digitalcommons.mtu.edu/michigantech-p2/1602