Deep Learning Compiler Design

June 29, 2024 (last updated July 10, 2024)

The Compiler Design is motivated by the features the compiler needs to support. In order to have a well thought compiler design, we must first work backwards from a list of features/requirements. These requirements then inform the design decisions we must make.

Key Design components of DL compilers includes multi-level IR, front end optimizations, backend optimizations. Common design architecture of a DL compiler consists of the front end, middle end and backend. The IR is spread across the front and backend. DL Models are translated into multilevel IR.

Intermediate Representations

Intermediate Representations and choices made about IR play a key role in compiler design. For example - How many IRs before you get to assembly? What level of abstraction should an IR have. Different levels of abstractions will expose different concepts (high level vs low level)

What optimizations would you perform on the IR? based on the level of abstraction, certain optimizations are more feasible / make more sense. Good Discussion in this video https://youtu.be/Mcq5gNNDXTw?feature=shared

High Level IR

High level IR is closer to the front end and is used for hardware independent optimizations. The high-level IR is also known as graph IR, and represents the computation and control flow. The design challenge of hlir is how to abstract the computation and control flow so that it can capture and express diverse DL models. The goals of the hlir is to represent the control flow and dependency between the operators and the data, and provide an interface for graph level optimizations.

Design Decision - How to represent the Graph IR?

The representation of the graph ir influences its expressiveness and decides how the compiler will analyze the graph. Below are the main options

1. DAG - Based IR

Traditional compiler CFG representation. In DL compilers, the nodes represent the atomic DL operators. Nodes represent DL operators and edges represent the tensors.

You can apply standard compilation optimizations such as CSE (common sub-expression elimination) and DCE (dead code elimination).

DAG based IR is convenient for programming and compiling due to its simplicity, but it has deficiencies such as semantic ambiguity caused by the missing definition of computation scope

2. Let-binding-based IR

Let binding is one method to solve semantic ambiguity. Let binding based compiler figures out all results of the variables in let expression and builds a variable map. This is more efficient than DAG based because in order to find the value of a variable you have to perform recursive descent

Design Decision - How should the HLIR represent Tensor Computation?

Operators of deep learning frameworks are translated to graph ir according to the specific way it represents the computation on tensors. Representation of tensor computation can be divided into the following three categories.

Function-based - operators are encapsulated into their own functions / methods
Lambda expression - i think it’s like literally lambdas (in line functions)
Einstein notation (summation) - ???

TODO describe the pros and cons of each

Design Decision: Implementing the HLIR — How should the HLIR represent Tensor Data?

1. Placeholders

A placeholder is a variable with explicitly defined shape information (ie, the size of the dimensions), but the values will be populated at later stages of the computation. In other words, we expect data of a certain size/shape but don’t know what it is yet.

This allows the engineer to describe the operations and build the computation graph without worrying of the exact content of the data. This helps separation of concerns — separate the computation definition from the execution in the compiler

2. Dynamic Shape Representation — Can placeholder sizes be unknown?

Most compilers support an unknown size, and this is needed to support dynamic models. TRADEOFF — to fully support dynamic models, the bound inference and dimension checking should be relaxed, and an extra mechanism is needed to guarantee memory validity.

Note

This raises another design decision / feature requirement — Does the compiler support Dynamic Models?

3. Data Layout

Data layout describes how a tensor is organized in memory and It is usually a mapping from a logical indices to memory indices. However, combining the data layout with the operators rather than tensors enables intuitive implantation of certain operators and redues the compilation overhead.

4. Bound Inference

Design Decision: Implementing the HLIR — What operators should the HLIR support?

The Operators supported by DL compilers are responsible for representing the DL workloads. They are the nodes of the computation graph. They usually include algebraic operators ( $+$ , $\times$ , exp and topK etc,) neural network operators (convolution, pooling), tensor operators (reshape, resize, copy), broadcast and reduction operators(ex- min, argmin) and control flow operators (ex- conditional and loop)

Broadcast: Broadcast operators can replicate the data and generate new data with compatible shape. without broadcast operators the input tensor shapes are more constrained

Control Flow: needed when representing complex and flexible models. Without supporting control flow in graph IR of DL compilers, these models have to rely on the control flow support of the host languages which deteriorates the computation efficiency.

Derivative: an operator that calculates the gradient (derivative) of a given op.

Customized operators: allows programmers to define their own operators for a particular purpose.

Discussion

DL compilers share similar design philosophy for their high level IRs.

DAG and let binding for computation graph
convenient ways to represent tensor computation
data and operators in HLIR are flexible and extensible to support different DL models
Hardware independent

Low Level IR

low-level IR is designed for hardware specific optimization and code generation on diverse hardware targets. Thus the low-level IR should be fine-grained enough to reflect the hardware characteristics and represent the hardware-specific optimizations. It should also allow the use of mature 3rd party tools such as Halide and LLVM

Implementation

Describes the computation of a DL model in a more fine grained way which enables target dependent optimizations by providing interfaces to tune the computation and memory access

MLIR

((this is to detailed skippinh))

Frontend Optimizations

Many Optimizations are easier to identify and perform at the graph level because the graph provides a global view of the computation. Frontend optimizations are usually defined by passes and can be applied by traversing the nodes of the computation graph and performing graph transformations. Developers can also define customized passes.

hardware independent optimizations can be classified into three categories:

node-level optimizations
block-level (peephole, local) optimizations
dataflow-level (global) optimizations

Node Level Optimizations

The nodes of the computation graph are “coarse” enough to enable optimizations inside a single node ((is it like a basic block?)). Node optimizations include node elimination and node replacement that replaces nodes with other lower-cost nodes.

In DL compilers Nop elimination is responsible for eliminating the operations lacking adequate inputs. (ex- the sum node with only one input tensor can be eliminated)

Block Level Optimizations

Algebraic Simplification

Algebraic identification
strength reduction
constant folding

Operator Fusion Enables better sharing of computation, eliminates intermediate allocations, facilitates further optimization by combining loop nests

Operator Sinking - ??

Dataflow-level optimizations

Common sub-expression elimination (CSE)

Dead code elimination

Static memory planning

Layout transformation

Backend Optimizations

Hardware-specific Optimizations

Loop Oriented Optimizations

Loop Fusion — fuse loops with the same boundaries for better data reuse
Sliding windows — its central concept is to compute values when needed and store them on the fly for data reuse until they are no longer required. As sliding windows interleaves the computation of 2 loops and make them serial, its a trade off between parallelism and data reuse
Tiling — splits loops into several tiles and thus loops are divided into outer loops iterating through tiles and inner loops iterating inside a tile. the transformation enables better data locality by fitting a tile into hardware caches.
Loop reordering — (AKA loop permutation) changes the order of iterations in a nested loop, which can optimize the memory access and thus increase spatial locality
Loop unrolling — Loop unrolling can unroll a specic loop to a xed number of copies of loop bodies, which allows the compilers to apply aggressive instruction-level parallelism. Usually, loop unrolling is applied in combination with loop split, which rst splits the loop into two nested loops and then unrolls the inner loop completely.

Deep Learning Compiler Design

Intermediate Representations

High Level IR

Design Decision - How to represent the Graph IR?

1. DAG - Based IR

2. Let-binding-based IR

Design Decision - How should the HLIR represent Tensor Computation?

Design Decision: Implementing the HLIR — How should the HLIR represent Tensor Data?

1. Placeholders

2. Dynamic Shape Representation — Can placeholder sizes be unknown?

3. Data Layout

4. Bound Inference

Design Decision: Implementing the HLIR — What operators should the HLIR support?

Discussion

Low Level IR

Implementation

Frontend Optimizations

Node Level Optimizations

Block Level Optimizations

Dataflow-level optimizations

Backend Optimizations

Hardware-specific Optimizations

Optimized Kernel Libraries