The Difference of SVE2 (Project Stage 3)

Hey everyone, and welcome to the final blog for SPO600.

Objective

ARMv9 is coming out soon, and it has improved SIMD implementation using a new technology titled Scalable Vector Extensions v2 (SVE2). This is an upgrade from SVE, an AArch64 extension used for flexible vector-length implementations.

The idea for this final blog is to analyze how ffmpeg will be operating differently with the coming of SVE2 and describe the changes that will need to be extended to support SVE2.

Definition

Firstly, let's take a look at the official documentation for SVE2, taken straight from the ARM website.

"SVE2 is a superset of SVE and Neon."

This means that SVE2 will support all of the functionality of SVE and the Neon extension with extended features. By this logic, ffmpeg will require no necessary changes to support the new SVE2 overhaul. However; there's a couple of new features in SVE2 that should be utilized to improve the efficiency of ffmpeg.

How do I write SVE code anyway?

In order to understand the differences of SVE2, let's first take a look at how this vectorized code is written in the first place.

In order to write or generate SVE/SVE2 code, you are given two options:

1. Write your own assembly code using SIMD instructions.

    To program in assembly, you must know the Application Binary Interface (ABI) standard updates for     SVE and SVE2. The Procedure Call Standard for Arm Architecture (AAPCS) specifies the data types     and register allocations and is most relevant to programming in assembly. The AAPCS requires that:

  • Z0-Z7, P0-P3 are used for parameter and results passing.
  • Z8-Z15, P4-P15 are callee-saved registers.
  • Z16-Z31 are the corruptible registers.

2. Use instruction functions in higher level languages such as C/C++.

    These functions, often called "intrinsics", are detailed in the ACLE, an ARM extension for C.  

    Intrinsics are simply higher level language functions that correspond to low level instructions,                familiarizing code for a lot of modern developers.

    For instance, this code:

    #include <arm_sve.h>
    svuint64_t uaddlb_array(svuint32_t Zs1, svuint32_t Zs2)
    {
             // widening add of even elements
        svuint64_t result = svaddlb(Zs1, Zs2);
        return result;
    }

    compiles into:

    uaddlb_array:                           // @uaddlb_array
            .cfi_startproc
    // %bb.0:
            uaddlb  z0.d, z0.s, z1.s
            ret

So what's new?

From their documentation, the following algorithms have been accelerated in the extension of SVE2:

  • Computer vision
  • Multimedia
  • Long-Term Evolution (LTE) baseband processing
  • Genomics
  • In-memory database
  • Web serving
  • General-purpose software

As I've highlighted, the documentation describes that multimedia algorithms have increased in efficiency in this new update.

But what exactly is changing in SVE2?

Well, ARM claims that SVE2 allows for more function domains in data-level parallelism". This sounds pretty technical, but it makes a lot more sense once they elaborate: "SVE2 inherits the concept, vector registers, and operation principles of SVE. SVE and SVE2 define 32 scalable vector registers."

From what I understand by operation principles, this means that the implementation will follow the same method for SVE2, since the number of vector registers does not change from 32.

There is however a major difference: the existing implementations are for fixed-width SIMD, while SVE2 is variable-width. AArch64 code will need to specify which SIMD technology is to be used.

New Functions

In particular, ARM indicates that there are several functions that replace existing functions. Here are two examples:

  • Transformed Neon integer operations, for example, Signed absolute difference and accumulate (SAB) and Signed halving addition (SHADD).
  • Transformed Neon widen, narrow, and pairwise operations, for example, Unsigned add long – bottom (UADDLB) and Unsigned add long – top (UADDLT) 
The below diagram demonstrates exactly some of these processes will change between the two AArch64 extensions.An illustration of the difference between the Neon and SVE2 processes

To be more specific, ARM mentions that these two categories of instructions will be seeing many replacements in the SVE2 overhaul:

  • Complex arithmetic, for example complex integer multiply-add with rotate (CMLA).
  • Multi-precision arithmetic for large integer arithmetic and cryptography, for example, Add with carry long – bottom (ADCLB), Add carry long – top (ADCLT), and SM4 encryption and decryption (SM4E).
New Architecture Features
 
SVE2 is still based on scalable vectors stored in register banks. On top of the Neon register banks, SVE and SVE2 adds three new registers:

    Scalable vector registers Z0-Z31:

    Scalable vector registers zo-z31 
    Scalable predicate registers P0-P15:

    Scalable predicate registers P0-P15

    Scalable vector system control registers ZCR_Elx:

    Scalable vector system control registers ZCR_Elx 

    To explain how these registers work, the following Assembly syntax example is provided:
     LDFF1D {<Zt>.D}, <Pg>/Z, [<Xn|SP>, <Zm>.D, LSL #3]

    Where:

    • Zt are the vectors, Z0-Z31
    • D, vector and predicate registers have known element type but unknown element numbers
    • Pg are the predicates, P0-P15
    • Z is the zeroing predication
    • Zm is gather-scatter or vector addressing

Closing Remarks

Once again, I found this lab to be pretty tricky since I'm not very familiar with AArch64 SIMD operations yet. But this made me really curious as to how SIMD code actually works.

Sources

All information was taken from ARM's SVE2 documentation, which can be visited here.

Reflection

Lastly, I'd like to use this moment to reflect on everything I've done so far in this course.

I signed up for this course in the hopes that it will help me writing future code for my own projects and other courses. I had no idea what to expect, but I figured it would be low-level.

The first few weeks I did very little work, completing the bare minimum and showing up to some Thursday morning classes. I tried creating Because I had a terrible sleep schedule, the early class times made it really difficult for me to be there consistently. I told myself "I'll just watch the lecture videos asynchronously"; but as with most asynchronous work, the motivation just wasn't there to catch up.

Around early October, Chris had decided that we must choose an external blogging website due to the unavailability of the course wiki. This was my chance to catch up. But I procrastinated on doing this for a very long time because I felt like I had no idea what I was doing in the course.

Weeks flew by until I found myself in November with almost no course work done. I would show up to Thursday class meetings and have nothing to contribute to the team. Finally, I began writing my first blogs. Fortunately, my classmate Elena was also behind, and so we decided to meet up and work on this course together.

Learning 6502 was not easy, but through watching a lot of course videos I managed to slide by the first couple of labs. To be honest, I had no idea what I was doing for x86 and the final project, but thankfully there's no required specific code output unlike how my OOP345 was, so I was able to just submit a blog with my results.

This course turned out to be really complicated, but I'm glad I was able to learn low-level for the first time! Until next time, thanks for reading!

Comments

Popular posts from this blog

So I examined another open-source project... (Project Stage 2)