Featured Post
Show HN: LegendAI-Amazon Sales Tracker https://ift.tt/Qmk4XB9
Show HN: LegendAI-Amazon Sales Tracker Get Actual Not Estimate Amazon Product Data! Real-Time Amazon Sales and Data Insights. Get accurate s...
Saturday, August 6, 2022
Show HN: Thread-Parallel Decompression and Random Access to Gzip Files (Pragzip) https://ift.tt/gHqfWhB
Show HN: Thread-Parallel Decompression and Random Access to Gzip Files (Pragzip) Hello HN, I'm very excited to have finished a gzip decoder that can speed up decompression using threads. On my Ryzen 3900X, I measured a 8x speedup over standard gzip, reaching 1.6 GB/s for a synthetic file with a consistent compression ratio of 1.3. A functional decompressor like this is kind of a first, that's why I am excited. Pragzip implements the two-staged decompression idea put forward with pugz, which unfortunately only works with gzipped text files not arbitrary files and has many more limitations. I think my main contribution over pugz might be a fast (~10 MB/s) data-agnostic deflate block finder, which might btw also be used to rescue corrupted gzip files). Note that pigz does compress files in parallel but it effectively is not able to decompress not even their own produced files in parallel. You can try out pragzip via PyPI or by building the C++ pragzip tool from source: python3 -m pip install --user pragzip pragzip --version # 0.2.0 Here is a quick comparison with a very Huffman-intensive workload tested on a 12-core Ryzen 3900X: base64 /dev/urandom | head -c $(( 4 * 1024 * 1024 * 1024 )) > 4GiB gzip 4GiB # compresses to a 3.1 GiB large file called 4GiB.gz time gzip -d -c 4GiB.gz | wc -c # real ~21.6 s (~200 MB/s) time pigz -d -c 4GiB.gz | wc -c # real ~12.9 s (~332 MB/s) time pragzip -P 0 -d -c 4GiB.gz | wc -c # real ~2.7 s (~1.6 GB/s decompression bandwidth) I have unit tests for files produced with gzip, bgzip, igzip, pigz, Python's gzip, and python-pgzip. It should therefore work for any "normal" gzip file and is feature-complete but needs a lot of testing and polishing. Note that it is very memory-intensive depending on the archive's compression factor and of course the number of cores being used. This will be subject to further improvements. Bug reports, feature requests, or anything else are very welcome! https://ift.tt/xIaEqzH August 6, 2022 at 07:02AM
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment