杰出论文

Synthesis of Sorting Kernels

Tensorize: Fast Synthesis of Tensor Programs from Legacy Code using Symbolic Tracing, Sketching and Solving

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

优化与变型

SySTeC: A Symmetric Sparse Tensor Compiler

Pattern Matching in AI Compilers and Its Formalization

Scalar Interpolation: A Better Balance between Vector and Scalar Execution for SuperScalar Architectures

PreFix: Optimizing the Performance of Heap-Intensive Applications

A Priori Loop Nest Normalization: Automatic Loop Scheduling in Complex Applications

An Efficient Polynomial Multiplication Derived Implementation of Convolution in Neural Networks

Postie: Extending Post-increment Addressing for Loop Optimization and Code Size Reduction

Towards Efficient Compiler Auto-tuning: Leveraging Synergistic Search Spaces

Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture

Vectron: A Dynamic Programming Auto-vectorization Framework

机器学习工具及其优化

VEGA: Automatically Generating Compiler Backends using a Pre-trained Transformer Model

IntelliGen: Instruction-Level Auto-tuning for Tensor Program with Monotonic Memory Optimization

GraalNN: Context-Sensitive Static Profiling with Graph Neural Networks

TL; DR

本文主要介绍了一个框架,其利用 GNN 来增强编译器中间表示的静态信息,通过把调用图来决定内联的方法,在真实场景上提高了 3.7%的性能。

本文细节

背景:

  • PGO、BOLT 已经能大幅提升性能
  • PGO 受限于动态 profile 限制(有参考文献)
  • PGO 受限于动态的输入输出对,MySQL 等就不采用 PGO 技术
  • 静态分析基于启发式和开发者直觉,不如现在新兴的 ML 模型
  • 现有 SOTA 模型基于 XGBoost,但是
    • 未利用好 IR
    • 没利用好调用上下文信息
    • 受限于单个函数,没有广泛的调用上下文作为模型训练(人话:没在 Module 层面考虑,而是仅仅在 Function)
      贡献
  1. 用于 GraalVM 的基于GNN 的静态预测
  2. 减少了启发式的人工干预
  3. 引入了上下文敏感的技术来静态分析
  4. 在代码体积、运行时间都超过了现有 SOTA

说实话,其下游作用仅仅是为了 inline,也没说别的什么高大上的场景,比如分支预测。

作者目的是用相对较少的信息进行高质量的静态分析,因此:

  • 输入:CFGs,CFG 中提取的特征,如固定节点信息、浮动节点信息、CPU 周期、汇编体积。
  • ⑥论文/Proceedings/CGO2025/image.webp
    • 同时还会输入边矩阵,用于表示谁调用了谁
    • 还会进行分支预测

总的来说,本文不适合深入阅读,其在 Graal 上而不是通用编译器上进行测试。而且,第二第三章完全介绍了如何训练和推理,输入输出,非常得工程化,但在技术实践上可以作为参考资料,是一个通用的技术。

LLM-Vectorizer: LLM-Based Verified Loop Vectorizer

体系结构与代码生成

Calibro: Compilation-Assisted Linking-Time Binary Code Outlining for Code Size Reduction in Android Applications

TL;DR

作者为了减小移动设备上 OAT 文件的代码体积,在编译时和链接时设置了一个框架来提取相同代码,平均减少 15%的代码体积。

本文细节

  • Google Play 要求软件压缩后代码体积不到 200M
  • 软件大让用户不想下载
  • 安卓编译器相对于 LLVM 和 GCC 不注重代码体积的优化:平均有 25%的代码冗余。
  • 现有工作只在 LLVM 和 iOS 上做,没有在安卓上做。
  • CodeOutling 的副作用是程序运行效率下降
  • 作者收集了重要信息,能在链接时辅助优化。
  • 作者使用平行后缀树减小了执行时间(不过应该是炒冷饭)
  • 只降低了 1%的性能
  • 本文背景:
    • Code Outlining
    • Code Redundancy Detection:使用后缀树搜索重复片段
    • Code Redundancy Elimination
  • ⑥论文/Proceedings/CGO2025/image 1.webp
  • 后缀树指式某片段出现了的次数
  • 大部分重复片段较小,且基本为安卓运行时
  • 作者提议用 Cache 来存储重复片段

点评:
作者花了大把时间介绍后缀树,这并不是难的知识。
作者优先把冗余代码实验做了,证明可以优化代码体积,这是体系结构的常见试验方法。

A Multi-level Compiler Backend for Accelerated Micro-kernels Targeting RISC-V ISA Extensions

xDSL: Sidekick Compilation for SSA-Based Compilers

机器学习编译器

ANT-ACE: An FHE Compiler Framework for Automating Neural Network Inference

CUrator: An Efficient LLM Execution Engine with Optimized Integration of CUDA Libraries

Accelerating LLMs using an Efficient GEMM Library and Target-Aware Optimizations on Real-World PIM Devices

MLIR

The MLIR Transform Dialect: Your Compiler Is More Powerful Than You Think

Combining MLIR Dialects with Domain-Specific Architecture for Efficient Regular Expression Matching

DialEgg: Dialect-Agnostic MLIR Optimizer using Equality Saturation with Egglog

量子计算

Synthesis of Quantum Simulators by Compilation

Weaver: A Retargetable Compiler Framework for FPQA Quantum Architectures

ASDF: A Compiler for Qwerty, a Basis-Oriented Quantum Programming Language

Qubit Movement-Optimized Program Generation on Zoned Neutral Atom Processors

程序分析与综合

Automatic Synthesis of Specialized Hash Functions

Stack Filtering: Elevating Precision and Efficiency in Rust Pointer Analysis

SkipFlow: Improving the Precision of Points-to Analysis using Primitive Values and Predicate Edges

安全与恢复力

FastFlip: Compositional SDC Resiliency Analysis

MTE4JNI: A Memory Tagging Method to Protect Java Heap Memory from Illicit Native Code Access

Memory Safety Instrumentations in Practice: Usability, Performance, and Security Guarantees

GPU 及其并行

Code Generation for Cryptographic Kernels using Multi-word Modular Arithmetic on GPU

CuAsRML: Optimizing GPU SASS Schedules via Deep Reinforcement Learning

Proteus: Portable Runtime Optimization of GPU Kernel Execution with Just-in-Time Compilation

安全、容错与密码学

Qiwu: Exploiting Ciphertext-Level SIMD Parallelism in Homomorphic Encryption Programs

Cage: Hardware-Accelerated Safe WebAssembly

Teapot: Efficiently Uncovering Spectre Gadgets in COTS Binaries

Janitizer: Rethinking Binary Tools for Practical and Comprehensive Security

Parallaft: Runtime-Based CPU Fault Tolerance via Heterogeneous Parallelism

运行时和系统工具

Honey Potion: An eBPF Backend for Elixir

GoFree: Reducing Garbage Collection via Compiler-Inserted Freeing

Improving Write-Heavy Startup Performance

Speeding up the Local C++ Development Cycle with Header Substitution