看到了不错的关于 Swift 编译器构建的系列文章，Swift 编译器的构建与 LLVM 也有联系，做了下学习笔记。

1、构建

拉取所需的其他库

1

./swift/utils/update-checkout --clone

构建可以调试的版本

1

./swift/utils/build-script --release --debug-swift

build-script 实际上是一个 python 脚本，它会调起另一个脚本 build-script-impl，然后使用 cmake 来进行编译配置，最后构建 swift 编译器。

1.1、为什么 swift 项目会使用 cmake？

注：cmake 是用来生成编译配置的，会生成 build files，使用这些 build files 就可以进行编译和链接了。

使用 cmake -G <generator> 可以生成不同的编译工具所需的 build files。

生成之后不想手动操作编译工具，也可以用 cmake --build <build_dir> 让 cmake 来进行操作。

一般来说 iOS 开发都是使用 Xcode project files 来进行 iOS 工程构建的，为什么 swift 项目要使用 cmake 管理构建？

多平台兼容可移植。Xcode 是 macOS 的应用；而 cmake 可以生成 linux 所需的构建文件，或者 Windows 上的 Visual Studio 文件。
多人协作便利性。Xcode project files 只是一堆带有自动生成 ID 字串的 XML 文件，在多人协作时解决冲突会很痛苦；而 cmake 使用的是纯文本脚本语言，易读易修改。
速度。一些 cmake 兼容的编译工具，构建项目会比 Xcode 快很多，比如说 Ninja。

对于 Ninja 和 Cmake，我之前也写过两篇相关的译文，感兴趣可以找一下。

1.2、依赖

swift 工程的构建会依赖 3 个其他的项目：

apple/swift-cmark：一个解析 Markdown 的库，用来解析 Markdown 文档和注释。
apple/swift-llvm：用来作为编译器的后端，可以支持不同的平台(arm、x86等)，还用到了 lit 和 FileCheck 来跑 swift 的测试用例。
apple/swift-clang：用来作为 swift 和 C、OC 语言的互调和使用。

在构建 swift-llvm 之后，还需要给 llvm 编译器补充所需要的 C++ 头文件，可以符号链接 Xcode 里的：

1
2
3


ln -s \
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++ \
./swift-llvm-build/include

弄好上面的 3 个项目后，就可以开始构建 swift 编译器的编译配置了。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


cmake \
-H./swift \
-B./swift-build \
-G Ninja \
-DCMAKE_BUILD_TYPE="Debug" \
-DSWIFT_PATH_TO_CMARK_SOURCE=./swift-cmark \
-DSWIFT_PATH_TO_CMARK_BUILD=./swift-cmark-build \
-DSWIFT_PATH_TO_LLVM_SOURCE=./swift-llvm \
-DSWIFT_PATH_TO_LLVM_BUILD=./swift-llvm-build \
-DSWIFT_PATH_TO_CLANG_SOURCE=./clang \
-DSWIFT_PATH_TO_CLANG_BUILD=./clang-build

而 build-script-impl 这个 python 脚本就是做了上述的操作：对这 3 个项目的配置和构建、符号链接、构建 swift 编译器。封装起来之后，用户就不需要直接操作 cmake 了，通过 build-script 即可完成构建任务。而且还会有一些项目是没有使用 cmake 配置的，使用 build-script 脚本可以很好的封装起来。

但对于开发者而言，添加新的构建配置时，就需要同时修改 cmake、build-script、build-script-impl 了；而且 build-script --help 也不足够清晰。还有，不论修改的是 3 个项目中的哪个，build-script 都会在每次触发时，重新执行上述的操作，这样会比直接进行 cmake --build 慢很多。

所以当执行增量编译时，最好的方式是直接运行 build files：

1
2
3


cmake --build ./build --target SwiftOptions
# 或 ninja 的方式
ninja -C ./build/swift-macosx-x86_64

1.3、cmake 如何构建 swift 可执行文件？

当 cmake 被执行配置工程时，swift/CMakeLists.txt 会被执行，它会添加 swift/cmake/modules 到 cmake module path 中，其中会 include 进来 swift/cmake/modules/AddSwift.cmake，在里面有 add_swift_host_tool() 函数的定义。
然后，swift/CMakeLists.txt 会调用 add_subdirectory() 对每一个子目录操作，包括 swift/tools。每个子目录都会有 CMakeLists.txt 文件，递归地 include 它们自己的子目录。swift/tools/CMakeLists.txt 也是如此，对 swift/tools/driver 操作。
swift/tools/driver/CMakeLists.txt 包含有 cmake 代码，描述如何构建 swift 可执行文件。它调用 add_swift_host_tool() 函数，而这个函数最终也会调用 cmake 的内置函数 add_executable() 和 target_link_libraries()，来描述如何链接 swiftDriver 和 swiftFrontend 来构建 swift 可执行文件。

2、Swift Driver

Swift Driver 是 swift 编译器的起始操作，driver.cpp 中的 main() 函数调用到了库 libswiftDriver，用来把对 swift 或 swiftc 的调用拆分成一个个更小的块来进行执行。这些更小的块被称之为 “jobs”。

使用 swiftc -driver-print-jobs 可以输出任务查看。任务包括了 swift -frontend 和 ld，生成 .swiftmodule 还需要 swift -frontend -merge-modules，当然不止这些。但有了 driver，实际上只是一条命令 swiftc -emit-module。

Swift Driver 还会协助增量编译，决定哪些 jobs 需不需要执行，输出是否有效或过期。

对 swift 或 swiftc 的直接调用，而不带有 -frontend 参数的话，都是会调用到 driver 里面。而 driver 只是生成 jobs，然后触发 swift -frontend。

2.1、swiftc 是 swift 可执行文件的符号链接

其实不止 swiftc 这一个符号链接，还有其他的：

这些不同的符号链接，代表着不同的 driver mode，用来接收不同的参数，以及执行不同的任务(tasks)。比如，swiftc 接收参数 -dump-ast，而这个参数对 swift 是无效的；swift-format 会处理 swift 代码的缩进和空行，但不会执行像 swiftc 的编译任务。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


// swift/lib/Driver/Driver.cpp
void Driver::parseDriverKind(ArrayRef<const char *> Args) { 
// ...
	Optional<DriverKind> Kind =
	llvm::StringSwitch<Optional<DriverKind>>(DriverName)
	.Case("swift", DriverKind::Interactive)
	.Case("swiftc", DriverKind::Batch)
	.Case("swift-autolink-extract", DriverKind::AutolinkExtract)
	.Case("swift-format", DriverKind::SwiftFormat)
	.Default(None);
// ...
}

它会通过名字去进入不同的 driver mode，所以如果自己随意地符号链接一个新的名字，那么参数将无法识别出来。但还是可以使用 --driver-mode=swift 来进行定义。

2.2、Driver 的操作

检查是否需要被切分成 jobs。
如果调用被切分成 jobs 了，则实例化 swift::Driver，会决定进入哪种 driver mode。
如果进入 swiftc 的 mode，则使用 swift::Driver::buildCompilation 实例化 swift::Compilation，建立输入和输出的映射关系。
swift::Driver::buildActions 创建 swift::Action 对象的图，代表更小单元的工作，比如 “编译 swift 代码为目标文件” 和 “链接这些目标文件为一个可执行文件”。可以使用参数 -driver-print-actions 来查看。
Actions 包含所有需要的信息来实例化 swift::job，调用 swift::Compilation::buildJobs 函数来把 actions 翻译为 jobs。可以使用参数 -driver-print-jobs 来查看。
最后，调用 swift::Compiler::performJobs 函数在任务队列中执行每个 jobs。

如果执行 swift hello，那么 Driver 会把它当作是执行子命令(subcommand)，会尝试去调起 swift-hello 的可执行文件。这种 swift <subcommand-name> 的方式都会被认为是子命令。

1
2
3
4


$ swift hello
error: unable to invoke subcommand: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swift-hello 
(No such file or directory)

检查完是否是子命令后，就会依据第一个参数来进入不同的执行分支：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


int main(int argc_, const char **argv_) {
// ...checks whether to run a subcommand.
	StringRef FirstArg(argv[1]);
	if (FirstArg == "-frontend") {
		return performFrontend(llvm::makeArrayRef(argv.data()+2, argv.data()+argv.size()),
				        argv[0], (void *)(intptr_t)getExecutablePath);
	}
	if (FirstArg == "-modulewrap") {
		return modulewrap_main(llvm::makeArrayRef(argv.data()+2, argv.data()+argv.size()),
					argv[0], (void *)(intptr_t)getExecutablePath);
	}
	if (FirstArg == "-apinotes") {
		return apinotes_main(llvm::makeArrayRef(argv.data()+1,
					argv.data()+argv.size()));
	}

	Driver TheDriver(Path, ExecName, argv, Diags);

上面的子命令和执行分支都不是的话，Driver 就开始实例化了，实例化 Swift 的参数表，决定执行什么 dirver mode。

swift-format 模式会调 swift_format_main 函数结束；而其他 mode 会 Driver::parseArgString 开始解析参数，比如有些参数可能不兼容的问题等等，解析没有错误的话，就会实例化 swift::ToolChain 对象来把 actions 翻译成 jobs。

ToolChain 对象可以把高层级抽象的、描述 swift 编译输入输出的 action，转换成具体的 jobs。

swift::ToolChain 是抽象基类，会让对应平台的子类进行操作，比如 swift::toolchains::Darwin，创建 macOS 或者 iOS 平台的 jobs。当然一些相同的参数操作，还是会放到基类里面，子类只是重写部分处理函数。

Driver 还需要创建 OutputInfo 对象来判断命令对应哪些 actions，该对象最重要的属性莫过于 CompilerOutputType 和 LinkAction 了，决定要生成可执行文件还是库文件等等。

而 Driver::buildJobsForAction 函数还会记录 actions 之间的依赖关系，用来提升 swift 的增量编译。

最后，当然就是 Compilation::performJobs 执行这些 jobs，在任务队列中调度并且执行。

2.3、TableGen

TableGen 是为了开发和维护特定领域(domain-specific information，DSI)的信息记录，简单来说，就是把诸如 Options.td、FrontendOptions.td 这类文件转换成像 C 的宏一样的语法(.inc)。

1
2
3
4


include "llvm/Option/OptParser.td"

def driver_print_jobs : Flag<["-"], "driver-print-jobs">, InternalDebugOpt
HelpText<"Dump list of jobs to execute">;

使用 llvm-tblgen 就可以对 td 文件进行转换了，比如上面的 DSI，可以被转换如下：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


OPTION(
	prefix_1,
	"driver-print-jobs",
	driver_print_jobs,
	Flag,
	internal_debug_Group,
	INVALID,
	nullptr,
	HelpHidden | DoesNotAffectIncrementalBuild,
	0,
	"Dump list of jobs to execute",
	nullptr,
	nullptr)

给 llvm-tblgen 传不同的参数，会转换成不同的形式。

在 cmake 中，libswiftOption 会设置 SwiftOptions 为依赖，意味着 SwiftOptions 会被先构建，所以 llvm-tblgen 会作用到 swift/include/swift/Option/Options.td 来生成 /path/to/build/swift-macosxx86_64/include/swift/Option/Options.inc 文件。然后在代码中，会看到有 #include "Options.inc" 来对其进行使用，通常前面会先加上 #define 让 inc 里面的语法生效。

1
2
3
4


# swift/lib/Option/CmakeLists.txt
add_swift_library(swiftOption STATIC

	DEPENDS SwiftOptions

在之前写的译文 LLVM 架构当中也有说到，TableGen 对于描述编译器后端的特定平台信息也非常重要，比如寄存器信息等。

3、Lexing & Parsing

在生成 swift 语法树时，会有 untyped 和 typed 的区别，分别对应的参数是 -dump-parse 和 -dump-ast，在 untyped 下，会有很多 type='<null>' 的节点，这些都会在之后被 type-checker 来补全类型信息，type-checker 是在 libswiftSema 被实现的。

3.1、frontend

frontend 大概在做下面这些事情：

当 driver 解析到第一个参数是 -frontend 时，会进入 libswiftFrontendTool 库的 performFrontend 函数。
performFrontend 会解析参数 CompilerInvocation::parseArgs，来决定 FrontendOptions::RequestedAction，然后它基于这些参数来实例化 CompilerInstance 和 ASTContext。最后它调用 libswiftFrontendTool 的 performCompile 函数。
performCompile 使用 FrontendOptions::RequestedAction 来决定是否调用 CompilerInstance::performParseOnly 或 CompilerInstance::performSema。
CompilerInstance::performSema 会在 Swift.swiftmodule 打开一个二进制流的游标(bitstream cursor)，比如用来决定解析表达式 print(...) 的类型。它会添加 SourceFile 节点到 AST 的根，同时调用 parseIntoSourceFile 函数来实例化 Parser 并且调用 Parser::parseTopLevel 函数。这个函数会开始对源代码的文本内容进行 lexing(词法分析) 和 parsing(语法分析)。
Parser 是和 Lexer 同时工作的，Lexer 创建和存储数据，会作为 Paser 初始化的一部分。Lexer 从文本中解析出 token(连续的、可结合在一起的、有意义的字符)，而 Parser 会决定解析什么和解析多少。比如 Lexer 解析出 ‘print’ 了，遇到 ‘(’ 就停止了，判断 print 是否为关键字，在下一次 Parser 请求 token 时提交给 Parser，那么 Parser 会要求 Lexer 解析下一个字符，因为下一个 token 即 ‘(’ 才能决定这是一个函数调用表达式。
Parser 和 Lexer 一直它们的无尽的循环，实例化出一个个新的 AST 节点，然后把它们添加到 ASTContext 当中。最终达到源文件的末尾，这时 libswiftFrontend 会继续调起对源文件的 type-checker。

CompilerInstance 这个类可谓是最重要的类之一了，在 libswiftFrontend 库中定义，它持有很多重要单例的唯一指针，比如 ASTContext。而其中的 CompilerInvocation 负责众多的参数类(FrontendOptions、LangOptions)。

Swift module files(Swift.swiftmodule) 是 Swift ASTs，序列化成二进制的格式，称之为 LLVM bitstream。CompilerInstance::loadStdlib 会在 Swift 标准库模块文件中打开一个游标(cursor)，然后 libswiftSema 检查源文件的类型(type-check)，比如 print，它会用打开的游标(cursor)来查找 print。

如果在当前域内，无法为标识符名字找到对应的定义，那么就会创建一个 UnqualifiedDeclRefExpr 节点到 AST 当中。比如 print，是在 Swift 标准库模块中定义的，所以为 unqualified，需要 type-checker 来补全这个信息。

3.2、LLVM Support

解析参数的逻辑是依靠 libLLVMOption 来完成的，而在内存中存储源代码文本内容、展示代码位置(locations)、输出诊断(diagnostic)信息(warning、error等) 是依靠 libLLVMSupport 来完成的。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54


#include "llvm/Support/MemoryBuffer.h"
#include "llvm/Support/SourceMgr.h"

int main() {
// This string will represent our source program.
	llvm::StringRef Input =
		"func foo() {\n"
		"  print(\"Hello!\")\n"
		"}";

// The llvm::MemoryBuffer class is used to store
// large strings, along with metadata such as a
// buffer or file name. Here we instantiate a
// llvm::MemoryBuffer to store the contents of
// our source program.
	std::unique_ptr<llvm::MemoryBuffer> InputBuffer =
		llvm::MemoryBuffer::getMemBuffer(Input);

// The llvm::SourceMgr class is used to emit
// diagnostics for one or more llvm::MemoryBuffer
// instances. Here we instantiate a new
// llvm::SourceMgr and transfer ownership of our
// input buffer over to it.
	llvm::SourceMgr SourceManager;
	SourceManager.AddNewSourceBuffer(
		std::move(InputBuffer),
		/*IncludeLoc*/ llvm::SMLoc());

// Here we grab a pointer into the buffer.
// Incrementing and decrementing this pointer
// allows us to traverse the source program.
	const llvm::MemoryBuffer *SourceBuffer =
		SourceManager.getMemoryBuffer(1);
	const char *CurrentCharacter =
		SourceBuffer->getBufferStart();

// The llvm::SMLoc class is used to represent a
// location in an llvm::MemoryBuffer that is managed
// by llvm::SourceMgr. We instantiate an llvm::SMLoc
// here, for the starting location.
	llvm::SMLoc BufferStartLocation =
		llvm::SMLoc::getFromPointer(CurrentCharacter);

// The llvm::SourceMgr::PrintMessage function allows
// us to print a caret ât a specific llvm::SMLoc
// location.
	SourceManager.PrintMessage(
		BufferStartLocation,
		llvm::SourceMgr::DiagKind::DK_Remark,
		"This is the very beginning of the "
		"source buffer.");

	return 0;
}

加上 llvm::SMRange 的话可以展示范围：

而使用 llvm::SMFixIt 可以指导如何修复：

llvm::MemoryBuffer::getMemBuffer 可以加上相关的文件信息作为第二参数，以展示更详细的信息：

llvm::SourceMgr 持有内存缓存的 vector，通过 llvm::SourceMgr::AddNewSourceBuffer 可以向其中添加元素。

3.2.1、MemoryBuffer

如果读取大文件时，分配大内存，并且 read 进来，会导致 RAM 内存爆炸。MemoryBuffer 使用 mmap 系统调用来解决这个问题。

当 mmap 一个文件到程序内存当中，实际上并未被读进 RAM 里，等到进行这块内存操作时才会操作所需大小的字节读入 RAM，这是通过缺页中断来完成的。而且 mmap 只需读入内存一次，即可在进程间进行分享。但并非所有文件都采用 mmap 来操作，如果大小小于一页(或16kB)，还是会直接用 new 和 read 来读取。

Swift 和 Clang 都使用 llvm::MemoryBuffer::getFileOrSTDIN 这个 static 类函数来进行文件的读取。

在其中的读文件操作，在 Windows 和 Unix 平台是有区别的，这一点也是利用 cmake 完成提前的设置，主要是 config.h。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


// llvm/cmake/modules/HandleLLVMOptions.cmake
if(WIN32)
	set(LLVM_ON_WIN32 1)
	set(LLVM_ON_UNIX 0)
else(WIN32)
	if(UNIX)
		set(LLVM_ON_WIN32 0)
		set(LLVM_ON_UNIX 1) ...  
endif(WIN32)

// llvm/include/llvm/Config/config.h.cmake
/* Define to 1 if you have the `pread' function. */
#cmakedefine HAVE_PREAD ${HAVE_PREAD}
/* Define if this is Unixish platform */
#cmakedefine LLVM_ON_UNIX ${LLVM_ON_UNIX}
/* Define if this is Win32ish platform */
#cmakedefine LLVM_ON_WIN32 ${LLVM_ON_WIN32}

// build/include/llvm/Config/config.h
/* Define to 1 if you have the `pread' function. */
#define HAVE_PREAD 1
/* Define if this is Win32ish platform */
#define LLVM_ON_UNIX 1

// llvm/lib/Support/Path.cpp
// Include the truly platform-specific parts.
#if defined(LLVM_ON_UNIX)
#include "Unix/Path.inc"
#endif
#if defined(LLVM_ON_WIN32)
#include "Windows/Path.inc"
#endif

参考

APPLE/SWIFT Guide：https://modocache.io/

Implement #warning and #error #14048：https://github.com/apple/swift/pull/14048/files

Swift 编译器构建学习笔记

文章目录