Transcoding billions of Unicode characters per second with SIMD instructions

@article{Lemire2021TranscodingBO,
  title={Transcoding billions of Unicode characters per second with SIMD instructions},
  author={Daniel Lemire and Wojciech Mula},
  journal={Software: Practice and Experience},
  year={2021},
  volume={52},
  pages={555 - 575}
}
In software, text is often represented using Unicode formats (UTF‐8 and UTF‐16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state‐of‐the‐art disks and networks. These transcoding functions make little use of the single‐instruction‐multiple‐data (SIMD) instructions available on commodity processors. By designing transcoding algorithms for SIMD instructions, we multiply the speed of transcoding on… 
2 Citations

Transcoding Unicode Characters with AVX-512 Instructions

Intel includes on its recent processors a powerful set of instructions capable of processing 512-bit registers with a single instruction (AVX-512), which are leveraged to transcode strings between common formats: UTF-8 and UTF-16.

Efficient multivariate low-degree tests via interactive oracle proofs of proximity for polynomial codes

The first interactive oracle proofs of proximity (IOPP) for tensor products of Reed-Solomon codes and for Reed-Muller codes (evaluation of polynomials with bounds on individual degrees) are presented and simulta-neously achieve logarithmic query complexity, logarithsmic verification time, linear oracle proof length and linear prover running time.

References

SHOWING 1-10 OF 23 REFERENCES

SIMD-based decoding of posting lists

This paper starts by exploring variable-length integer encoding formats used to represent postings, and defines a taxonomy that classifies encodings along three dimensions, representing the way in which data bits are stored and additional bits are used to describe the data.

Validating UTF‐8 in less than one instruction per byte

The lookupalgorithm is presented, which outperforms UTF‐8 validation routines used in many libraries and languages by more than 10 times using commonly available single‐instruction‐multiple‐data instructions.

Faster Population Counts Using AVX2 Instructions

A vectorized approach using SIMD instructions can be twice as fast as using the dedicated instructions on recent Intel processors, and has been adopted by LLVM and is used by its popular C compiler (Clang).

Stream VByte: Faster byte-oriented integer compression

A General SIMD-Based Approach to Accelerating Compression Algorithms

By instantiating the approach, several novel integer compression algorithms are developed, called Group-Simple, Group-Scheme, group-AFOR, and Group-PFD, and implemented their corresponding vectorized versions.

UTF-16, an encoding of ISO 10646

The UTF-16 encoding of Unicode/ISO-10646 is described, the issues of serializingUTF-16 as an octet stream for transmission over the Internet are addressed, and MIME charset naming is discussed as described in [CHARSET-REG].

Vectorization for SIMD architectures with alignment constraints

This paper presents a compilation scheme that systematically vectorizes loops in the presence of misaligned memory references, and proposes several techniques to minimize the number of data reorganization operations generated.

Use of SIMD Vector Operations to Accelerate Application Code Performance on Low-Powered ARM and Intel Platforms

This paper considers and compares the NEON SIMD instruction set used on the ARM Cortex-A series of RISC processors with the SSE2 SIMD Instruction set found on Intel platforms within the context of the Open Computer Vision (OpenCV) library.

A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding

High performance SIMD text processing using the method of parallel bit streams using the way of intraregister and intrachip parallelism on multicore processors is introduced with a case study of UTF-8 to UTF-16 transcoding.