Transcoding billions of Unicode characters per second with SIMD instructions
@article{Lemire2021TranscodingBO, title={Transcoding billions of Unicode characters per second with SIMD instructions}, author={Daniel Lemire and Wojciech Mula}, journal={Software: Practice and Experience}, year={2021}, volume={52}, pages={555 - 575} }
In software, text is often represented using Unicode formats (UTF‐8 and UTF‐16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state‐of‐the‐art disks and networks. These transcoding functions make little use of the single‐instruction‐multiple‐data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on…
2 Citations
Transcoding Unicode Characters with AVX-512 Instructions
- Computer ScienceArXiv
- 2022
Intel includes on its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512), which are leveraged to transcode strings between common formats: UTF-8 and UTF-16.
Efficient multivariate low-degree tests via interactive oracle proofs of proximity for polynomial codes
- Computer Science, MathematicsElectron. Colloquium Comput. Complex.
- 2021
The first interactive oracle proofs of proximity (IOPP) for tensor products of Reed-Solomon codes and for Reed-Muller codes (evaluation of polynomials with bounds on individual degrees) are presented and simulta-neously achieve logarithmic query complexity, logarithsmic verification time, linear oracle proof length and linear prover running time.
References
SHOWING 1-10 OF 23 REFERENCES
SIMD-based decoding of posting lists
- Computer ScienceCIKM '11
- 2011
This paper starts by exploring variable-length integer encoding formats used to represent postings, and defines a taxonomy that classifies encodings along three dimensions, representing the way in which data bits are stored and additional bits are used to describe the data.
Validating UTF‐8 in less than one instruction per byte
- Computer ScienceSoftw. Pract. Exp.
- 2021
The lookupalgorithm is presented, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions.
Faster Population Counts Using AVX2 Instructions
- Computer ScienceComput. J.
- 2018
A vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors, and has been adopted by LLVM and is used by its popular C compiler (Clang).
A General SIMD-Based Approach to Accelerating Compression Algorithms
- Computer ScienceTOIS
- 2015
By instantiating the approach, several novel integer compression algorithms are developed, called Group-Simple, Group-Scheme, group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions.
UTF-16, an encoding of ISO 10646
- Computer ScienceRFC
- 2000
The UTF-16 encoding of Unicode/ISO-10646 is described, the issues of serializingUTF-16 as an octet stream for transmission over the Internet are addressed, and MIME charset naming is discussed as described in [CHARSET-REG].
Upscaledb: Efficient integer-key compression in a key-value store using SIMD instructions
- Computer ScienceInf. Syst.
- 2017
Vectorization for SIMD architectures with alignment constraints
- Computer SciencePLDI '04
- 2004
This paper presents a compilation scheme that systematically vectorizes loops in the presence of misaligned memory references, and proposes several techniques to minimize the number of data reorganization operations generated.
Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms
- Computer Science2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum
- 2013
This paper considers and compares the NEON SIMD instruction set used on the ARM Cortex-A series of RISC processors with the SSE2 SIMD Instruction set found on Intel platforms within the context of the Open Computer Vision (OpenCV) library.
A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding
- Computer SciencePPoPP
- 2008
High performance SIMD text processing using the method of parallel bit streams using the way of intraregister and intrachip parallelism on multicore processors is introduced with a case study of UTF-8 to UTF-16 transcoding.