In a previous blog post I gave a general introduction to GPU driver internals in Android/Linux systems. Following up with it, today I will explain how a specific functionality, hardware performance counter (perf counter) queries, is handled in both Qualcomm Adreno and ARM Mali drivers, by walking through the kernel driver source code.
Rationale: A Perf Counter Sampling Library
But, why looking into perf counters? What’s interesting about them? Perf counters are special processor registers showing various metrics about the processor. They are crucial for helping understanding software performance issues and making the best use of hardware.
But it also comes down to a library that I’ve been working on recently. So please allow me to digress a bit here. ๐
Unlike CPU, in the GPU world we have many different GPU architectures from quite a few hardware vendors. To understand the fine performance details, typically we need to resort to vendor-specific tools like AMD Radeon GPU Profiler, ARM Mobile Studio, NVIDIA Nsight, Qualcomm Snapdragon Profiler. They are really all-encompassing tool suites, providing many utilities and aiming for profiling the whole system. If you only care about one vendor or one GPU architecture and would just like an integrated solution, then they are great. But if you need to support multiple vendors and/or have your own profiling solution in the development flow, e.g., for better automation, then not so great: managing an IDE-like GUI application for each vendor is not really fun.
So I have been working on a lightweight and embeddable library for sampling GPU perf counters as an alternative solution. It should support multiple vendors. For performance (as we normally sample perf counters at a very high frequency), the plan is to directly interact with the GPU kernel driver. This is actually inspired by HWCPipe, which is a great resource showing how it is done for ARM Mali GPUs. However, some of its design choices (e.g., mandatory C++ features like STL and exceptions) renders it unsuitable for my needs. The biggest issue, though, is that it does not support other vendors. So this motivates me to write my own.
Anyway, I should probably use another blog post once I’ve the library ready. Now switching back to the main topic for today: perf counters in drivers.
Methodology
We will be looking at perf counters for both Qualcomm Adreno and ARM Mali GPUs. It requires reading the kernel driver source code, which we cannot find in the upstream Linux kernel source tree. Instead they are released by the Android OEMs shipping products with these GPUs.
Here, I’ll use the code released by Samsung for their Galaxy S21 series. Depending on the market, Galaxy S21 contains either the Snapdragon 888 (e.g., devices with model number SM-G991U) or the Exynos 2100 (e.g., devices with model number G-991B) SoC. I downloaded the kernel code from Samsung’s open source website and put them on GitHub (SM-G991U, SM-G991B) so that I can grab links to use in this post.
As explained in my previous blog post, GPU drivers use common frameworks and
have similar structure in their implementation. That gives
us anchors to read the source code. For example, we can search for module_init
in the driver’s directory to find out the entry point for the whole module.
Similarly, platform_driver_register
’s argument defines the driver’s major
traits, including the name, which hardware device to match, and so on.
With all of the above, we can look into each GPU now.
Adreno GPU
First let’s have some fun reading the kernel code.
Kernel code walkthrough
The kernel driver for Adreno GPUs are called KGSL, short for Kernel Graphics
Support Layer. It is written as a loadable kernel module and uses the platform
driver framework. So the anchors mentioned in the above section work; grep
ping
them will show that the drivers/gpu/msm/adreno.c
file is the main
file1 pulling everything together and containing various
function pointers.
Device/driver information
Within it, adreno_platform_driver
is the struct defining the driver:
static struct platform_driver adreno_platform_driver = {
.probe = adreno_probe,
.remove = adreno_remove,
.driver = {
.name = "kgsl-3d",
.pm = &adreno_pm_ops,
.of_match_table = of_match_ptr(adreno_match_table),
}
};
That’s where the driver name, kgsl-3d
, comes from; and it’s matching against
devices specified in the adreno_match_table
2:
static const struct of_device_id adreno_match_table[] = {
{ .compatible = "qcom,kgsl-3d0", .data = &device_3d0 },
{ },
};
The .compatible
field is the interesting one; it follows the
<vendor>,<device>
format. So the driver is compatible with devices from vendor
qcom
and with the name kgsl-3d0
. Such devices can be
bound to and managed by this driver.
Thus far these are all pretty straightforward stuff; but I just wanted to point them out so we are on a solid footing regarding the device/driver names.
Details about the device, if you want to understand more, can be found via
searching kgsl-3d0
in the kernel codebase, because they need to expose that
name in the compatible
field in their Device Tree Source files. For example,
Adreno 660 is defined in the arch/arm64/boot/dts/vendor/qcom/lahaina-gpu.dtsi
file3. The doc for the fields in it can be found in
the arch/arm64/boot/dts/vendor/bindings/gpu/adreno.txt
file.
Low level details are unlikely useful to us there; but the power levels can be
interesting as it defines the frequencies we can see for the GPU.
Let’s continue to look at device driver binding, which is done in the
andreno_probe
function. That shows that the driver is
actually an aggregate driver; it uses component helpers
to pull in components like Graphics Management Unit. Anyway, eventually
it calls the adreno_bind
function, which then in turn calls
the GPU core specific probe
function:
static int adreno_bind(struct device *dev)
{
struct platform_device *pdev = to_platform_device(dev);
const struct adreno_gpu_core *gpucore;
// ...
return gpucore->gpudev->probe(pdev, chipid, gpucore);
}
GPU core definition
We are approaching the meaty definitions–the adreno_gpu_core
struct:
struct adreno_gpu_core {
enum adreno_gpurev gpurev;
unsigned int core, major, minor, patchid;
const char *compatible;
unsigned long features;
struct adreno_gpudev *gpudev;
const struct adreno_perfcounters *perfcounters;
// ...
};
Yes, perfcounters
! But before looking into that, also worth noting is the
adreno_gpudev
struct inside. It’s a huge struct
containing GPU core specific function pointers, including the probe
function
mentioned earlier.
Looking at where adreno_gpu_core
are referenced, we can find the full list
of Adreno GPU core definitions in the drivers/gpu/msm/adreno-gpulist.h
file. This is basically the main file containing pointers to
various GPU core facts. From it we can see, for example, for A6XX GPU series,
the perf counters are defined in the adreno_a6xx_perfcounters
variable, in
the drivers/gpu/msm/adreno_a6xx_perfcounter.c
file.
There we can find all the perf counter groups.
(It might seem that we are going through a rather convoluted approach here to discover this, as we might be able to directly find such information by trying to find source files with keywords related to perf counters. But I generally feel the above is better as it is more principled and can be used to discover whatever you’d like to know. The same holds for the following analysis.)
Ioctl interface
Okay, now we know there are quite a few perf counter groups. But still, how
do we query them from the kernel? That comes to the ioctl
system call.
If we follow what we left previously, the adreno_bind
function calls the GPU
core specific probe
function. If we look a concrete one, e.g., the
a6xx_probe
function that is registered to the
adreno_a6xx_gpudev
struct, it calls the a6xx_probe_common
function, and
then in turn calls the adreno_device_probe
function,
which then in turn calls the adreno_setup_device
function. adreno_setup_device
references a
adreno_functable
struct. There, we have a bunch of
function pointers, including the one for ioctl: the adreno_ioctl
function. It actually only handles a few ioctl commands,
all listed in the adreno_ioctl_funcs
struct and all
related to perf counters:
static struct kgsl_ioctl adreno_ioctl_funcs[] = {
{ IOCTL_KGSL_PERFCOUNTER_GET, adreno_ioctl_perfcounter_get },
{ IOCTL_KGSL_PERFCOUNTER_PUT, adreno_ioctl_perfcounter_put },
{ IOCTL_KGSL_PERFCOUNTER_QUERY, adreno_ioctl_perfcounter_query },
{ IOCTL_KGSL_PERFCOUNTER_READ, adreno_ioctl_perfcounter_read },
{ IOCTL_KGSL_PREEMPTIONCOUNTER_QUERY, adreno_ioctl_preemption_counters_query },
};
Searching the command symbols, we find they are all defined in the
include/uapi/linux/msm_kgsl.h
header. uapi
means APIs for
userspace here, so that matches. After reading the related ioctl struct
comments, it’s relatively clear that we need to
issue
IOCTL_KGSL_PERFCOUNTER_GET
for activating the desired perf counters we want. There is a limit on how many counters we can enable per group. (The limit is reflected by how manyadreno_perfcount_register
s we have per group. They can be found in, for example, theadreno_a6xx_perfcounter.c
file.)IOCTL_KGSL_PERFCOUNTER_PUT
for deactivating perf counters after done.IOCTL_KGSL_PERFCOUNTER_READ
for sampling perf counters.
And the full list of perf counter groups is also defined in the same header file.
Perf counters
I hope the above is interesting. Thus far we know the ioctl commands to use for
interacting with the kernel driver and we know there are quite a few perf
counter groups. But we still don’t know what those exact counters are! That’s
where the open source Freedreno driver comes as super helpful. In its
envytool subproject, we can directly find all the perf counters in
an XML database. For example, for the A6XX series, it would be the
registers/adreno/a6xx.xml
file. It contains
enums whose names end with perfcounter_select
and
that’s what we want.
Up to this point, we basically have all the information we need. Now we can put everything together as a proof of concept. I created a Gist for it to sample a hardcoded list of counters for 100 iterations. Everything seems fine.
Mali GPU
Due to the existence of the HWCPipe project, we can actually know how to sample perf counters from the kernel directly. But the above methodology and steps still apply. (And to truly understand the interaction, it’s inevitable to read the kernel code.) I’ll just point out some key points regarding the Mali driver code here.
Kernel API versions
Compared to Adreno GPUs, the driver code for Mali GPU is actually much more complex: you can find multiple copies of the driver at different versions; and for the same copy, it’s using a versioned API! This actually makes sense. Compared to Adreno, which is only used by Qualcomm, Mali GPUs are licensed to various SoC vendors as IP blocks. Different vendors have different needs that ARM needs to serve, thus requiring the kernel code to structure like this way.
But it does mean more steps to interact with the kernel driver. We need to additionally negotiate the API version and set up the API context.
GPU characteristics
Figuring out the exact GPU characteristics is also harder, as Mali GPUs are configurable (again for satisfying different SoC vendors' needs). There can be a varying number of cores or L2 cache slices. So that all needs to be factored in to properly calculate the final perf counter value. To make things even obscure, unlike Adreno GPUs where we can get the GPU ID like the marketing product name (Adreno 540/650/etc.), GPU ID reported by the Mali kernel driver has nothing to do with the marking name (Mali G57/G78/etc.). These GPU properties are all packed as key value pairs and returned as a flat buffer when using ioctl to query them. To show how it’s done, here is a Gist file dumping some interesting properties of Mali GPUs.
Perf counters
Perf counters in the Mali kernel driver go into a separate API entry point.
Unlike Adreno, where we use the main device file descriptor to handle perf
counters, for Mali GPUs we need to request another dedicated file descriptor
from the driver for perf counters. The
drivers/gpu/arm/bv_r26p0/mali_kbase_ioctl.h
file contains
top-level ioctl commands, including the ioctl command for
setting up perf counter reader. The ioctl commands for perf counters are in the
drivers/gpu/arm/bv_r26p0/mali_kbase_hwcnt_reader.h
file.
The ioctl entry point function is
kbasep_vinstr_hwcnt_reader_ioctl
.
In a Mali GPU, there are four functionality blocks (job manager, tiler, shader core, memory) that can emit perf counters. Each functionality block always returns a fixed-size block containing 64 counters. All the perf counters are packed into a continuous buffer, whose layout is detailed here. But the exact meaning of each counter varies per device. What’s nice, though, is that ARM publishes very detailed explanations of their perf counters. So no need for guess work. ๐
Closing Remarks
That’s it. Thanks for reading through! Hopefully this provides useful information for you. It just happens that I need to understand perf counters so I’m using them as examples here. But really, this blog post is more to show that by inspecting the kernel code we can gain a lot of insights into those mobile GPUs.
-
Note that there is also an
drivers/gpu/drm/msm
directory; but that’s for the open source drivers. ↩︎ -
In case you are curious,
of
stands for “Open Firmware”. The Open Firmware Project defines the device tree. ↩︎ -
“Lahaina” is Snapdragon 888’s codename. ↩︎