Tetragon’s TracingPolicy is a user-configurable Kubernetes custom resource (CR) that
allows users to trace arbitrary events in the kernel and optionally define
actions to take on a match. Policies consist of a hook point (kprobes,
tracepoints, and uprobes are supported), and selectors for in-kernel filtering
and specifying actions. For more details, see
hook points page and the
selectors page.
CautionTracingPolicy allows for powerful, yet low-level configuration and, as such,
requires knowledge about the Linux kernel and containers to avoid unexpected
issues such as TOCTOU bugs.
For the complete custom resource definition (CRD) refer to the YAML file
cilium.io_tracingpolicies.yaml.
One practical way to explore the CRD is to use kubectl explain against a
Kubernetes API server on which it is installed, for example kubectl explain tracingpolicy.spec.kprobes provides field-specific documentation and details
on kprobe spec.
Tracing Policies can be loaded and unloaded at runtime in Tetragon, or on
startup using flags.
With Kubernetes, you can use kubectl to add and remove a TracingPolicy.
You can use tetra gRPC CLI to add and remove a TracingPolicy.
You can use the --tracing-policy and --tracing-policy-dir flags to statically add policies at
startup time, see more in the daemon configuration page.
Hence, even though Tracing Policies are structured as a Kubernetes CR, they can also be used in
non-Kubernetes environments using the last two loading methods.
1 - Example
Learn the basics of Tracing Policy via an example
To discover TracingPolicy, let’s understand via an example that will be
explained, part by part, in this document:
Warning This policy is for illustration purposes only and should not be used to
restrict access to certain files. It can be easily bypassed by, for example,
using hard links.
The policy checks for file descriptors being created, and sends a SIGKILL signal to any process that
creates a file descriptor to a file named /tmp/tetragon. We discuss the policy in more detail
next.
The first part follows a common pattern among all Cilium Policies or more
widely Kubernetes object.
It first declares the Kubernetes API used, then the kind of Kubernetes object
it is in this API and an arbitrary name for the object that has to comply with
Kubernetes naming convention.
The beginning of the specification describes the hook point to use. Here we are
using a kprobe, hooking on the kernel function fd_install. That’s the kernel
function that gets called when a new file descriptor is created. We
indicate that it’s not a syscall, but a regular kernel function. We then
specify the function arguments, so that Tetragon’s BPF code will extract
and optionally perform filtering on them.
See the hook points page
for further information on the various hook points available and arguments.
Selectors allow you to filter on the events to extract only a subset of the
events based on different properties and optionally take an action.
In the example, we filter on the argument at index 1, passing a file struct
to the function. Tetragon has the knowledge on how to apply the Equal
operator over a Linux kernel file struct and match on the
path of the file.
Then we add the Sigkill action, meaning, that any match of the selector
should send a SIGKILL signal to the process that initiated the event.
Learn more about the various selectors in the dedicated
selectors page.
Message
The message field is an optional short message that will be included in
the generated event to inform users what is happening.
spec:kprobes: - call:"fd_install"message:"Installing a file descriptor"
Tags
Tags are optional fields of a Tracing Policy that are used to categorize generated
events. Further reference here: Tags documentation.
Policy effect
First, let’s create the /tmp/tetragon file with some content:
echo eBPF! > /tmp/tetragon
You can save the policy in an example.yaml file, compile Tetragon locally, and start Tetragon:
Note Stop tetragon with Ctrl+C to disable the policy and
remove the BPF programs.
Once the Tetragon starts, you can monitor events using tetra, the tetragon CLI:
./tetra tetra getevents -o compact
Reading the /tmp/tetragon file with cat:
cat /tmp/tetragon
Results in the following events:
🚀 process /usr/bin/cat /tmp/tetragon
📬 open /usr/bin/cat /tmp/tetragon
💥 exit /usr/bin/cat /tmp/tetragon SIGKILL
And the shell where the cat command was performed will return:
Killed
See more
For more examples of tracing policies, take a look at the
examples/tracingpolicy
folder in the Tetragon repository. Also read the following sections on
hook points and
selectors.
2 - Hook points
Hook points for Tracing Policies and arguments description
Tetragon can hook into the kernel using kprobes and tracepoints, as well as in user-space
programs using uprobes. Users can configure these hook points using the correspodning sections of
the TracingPolicy specification (.spec). These hook points include arguments and return values
that can be specified using the args and returnArg fields as detailed in the following sections.
Kprobes
Kprobes enables you to dynamically hook into any kernel function and execute BPF code. Because
kernel functions might change across versions, kprobes are highly tied to your kernel version and,
thus, might not be portable across different kernels.
Conveniently, you can list all kernel symbols reading the /proc/kallsyms
file. For example to search for the write syscall kernel function, you can
execute sudo grep sys_write /proc/kallsyms, the output should be similar to
this, minus the architecture specific prefixes.
ffffdeb14ea712e0 T __arm64_sys_writev
ffffdeb14ea73010 T ksys_write
ffffdeb14ea73140 T __arm64_sys_write
ffffdeb14eb5a460 t proc_sys_write
ffffdeb15092a700 d _eil_addr___arm64_sys_writev
ffffdeb15092a740 d _eil_addr___arm64_sys_write
You can see that the exact name of the symbol for the write syscall on our
kernel version is __arm64_sys_write. Note that on x86_64, the prefix would
be __x64_ instead of __arm64_.
Caution
Kernel symbols contain an architecture specific prefix when they refer to
syscall symbols. To write portable tracing policies, i.e. policies that can run
on multiple architectures, just use the symbol name without the prefix.
For example, instead of writing call: "__arm64_sys_write" or call: "__x64_sys_write", just write call: "sys_write", Tetragon will adapt and add
the correct prefix based on the architecture of the underlying machine. Note
that the event generated as output currently includes the prefix.
In our example, we will explore a kprobe hooking into the
fd_install
kernel function. The fd_install kernel function is called each time a file
descriptor is installed into the file descriptor table of a process, typically
referenced within system calls like open or openat. Hooking fd_install
has its benefits and limitations, which are out of the scope of this guide.
spec:kprobes: - call:"fd_install"syscall:false
Note Notice the syscall field, specific to a kprobe spec, with default value
false, that indicates whether Tetragon will hook a syscall or just a regular
kernel function. Tetragon needs this information because syscall and kernel
function use a different ABI.
Kprobes calls can be defined independently in different policies,
or together in the same Policy. For example, we can define trace multiple
kprobes under the same tracing policy:
Tracepoints are statically defined in the kernel and have the advantage of being stable across
kernel versions and thus more portable than kprobes.
To see the list of tracepoints available on your kernel, you can list them
using sudo ls /sys/kernel/debug/tracing/events, the output should be similar
to this.
You can then choose the subsystem that you want to trace, and look the
tracepoint you want to use and its format. For example, if we choose the
netif_receive_skb tracepoints from the net subsystem, we can read its
format with sudo cat /sys/kernel/debug/tracing/events/net/netif_receive_skb/format,
the output should be similar to the following.
Uprobes are similar to kprobes, but they allow you to dynamically hook into any
user-space function and execute BPF code. Uprobes are also tied to the binary
version of the user-space program, so they may not be portable across different
versions or architectures.
To use uprobes, you need to specify the path to the executable or library file,
and the symbol of the function you want to probe. You can use tools like
objdump, nm, or readelf to find the symbol of a function in a binary
file. For example, to find the readline symbol in /bin/bash using nm, you
can run:
nm -D /bin/bash | grep readline
The output should look similar to this, with a few lines redacted:
[...]
000000000009f2b0 T pcomp_set_readline_variables
0000000000097e40 T posix_readline_initialize
00000000000d5690 T readline
00000000000d52f0 T readline_internal_char
00000000000d42d0 T readline_internal_setup
[...]
You can see in the nm output: first the symbol value, then the symbol type,
for the readline symbol T meaning that this symbol is in the text (code)
section of the binary, and finally the symbol name. This confirms that the
readline symbol is present in the /bin/bash binary and might be a function
name that we can hook with a uprobe.
You can define multiple uprobes in the same policy, or in different policies.
You can also combine uprobes with kprobes and tracepoints to get a
comprehensive view of the system behavior.
Here is an example of a policy that defines an uprobe for the readline
function in the bash executable, and applies it to all processes that use the
bash binary:
This example shows how to use uprobes to hook into the readline function
running in all the bash shells.
LSM BPF
LSM BPF programs allow runtime instrumentation of the LSM hooks by privileged
users to implement system-wide MAC (Mandatory Access Control) and Audit policies
using eBPF.
List of LSM hooks which can be instrumented can be found in security/security.c.
To verify if BPF LSM is available use the following command:
cat /boot/config-$(uname -r)| grep BPF_LSM
The output should be similar to this if BPF LSM is supported:
CONFIG_BPF_LSM=y
Then, if provided above conditions are met, use this command to check if BPF LSM is enabled:
cat /sys/kernel/security/lsm
The output might look like this:
bpf,lockdown,integrity,apparmor
If the output includes the bpf, than BPF LSM is enabled. Otherwise, you can modify /etc/default/grub:
Kprobes, uprobes and tracepoints all share a needed arguments fields called args. It is a list of
arguments to include in the trace output. Tetragon’s BPF code requires
information about the types of arguments to properly read, print and
filter on its arguments. This information needs to be provided by the user under the
args section. For the available
types,
check the TracingPolicy
CRD.
Following our example, here is the part that defines the arguments:
args:- index:0type:"int"- index:1type:"file"
Each argument can optionally include a ’label’ parameter, which will be included
in the output. This can be used to annotate the arguments to help with understanding
and processing the output. As an example, here is the same definition, with an
appropriate label on the int argument:
To properly read and hook onto the fd_install(unsigned int fd, struct file *file) function, the YAML snippet above tells the BPF code that the first
argument is an int and the second argument is a file, which is the
struct file
of the kernel. In this way, the BPF code and its printer can properly collect
and print the arguments.
These types are sorted by the index field, where you can specify the order.
The indexing starts with 0. So, index: 0 means, this is going to be the first
argument of the function, index: 1 means this is going to be the second
argument of the function, etc.
Note that for some args types, char_buf and char_iovec, there are
additional fields named returnCopy and sizeArgIndex available:
returnCopy indicates that the corresponding argument should be read later (when
the kretprobe for the symbol is triggered) because it might not be populated
when the kprobe is triggered at the entrance of the function. For example, a
buffer supplied to read(2) won’t have content until kretprobe is triggered.
sizeArgIndex indicates the (1-based, see warning below) index of the arguments
that represents the size of the char_buf or iovec. For example, for
write(2), the third argument, size_t count is the number of char
element that we can read from the const void *buf pointer from the second
argument. Similarly, if we would like to capture the __x64_sys_writev(long, iovec *, vlen) syscall, then iovec has a size of vlen, which is going to
be the third argument.
CautionsizeArgIndex is inconsistent at the moment and does not take the index, but
the number of the index (or index + 1). So if the size is the third argument,
index 2, the value should be 3.
These flags can be combined, see the example below.
Note that you can specify which arguments you would like to print from a
specific syscall. For example if you don’t care about the file descriptor,
which is the first int argument with index: 0 and just want the char_buf,
what is written, then you can leave this section out and just define:
This field is only used for char_buff data. When this value is false (default),
the bpf program will fetch at most 4096 bytes. In later kernels (>=5.4) tetragon
supports fetching up to 327360 bytes if this flag is turned on.
The maxData flag does not work with returnCopy flag at the moment, so it’s
usable only for syscalls/functions that do not require return probe to read the
data.
Return values
A TracingPolicy spec can specify that the return value should be reported in
the tracing output. To do this, the return parameter of the call needs to be
set to true, and the returnArg parameter needs to be set to specify the
type of the return argument. For example:
In this case, the sk_alloc hook is specified to return a value of type sock
(a pointer to a struct sock). Whenever the sk_alloc hook is hit, not only
will it report the family parameter in index 1, it will also report the socket
that was created.
Return values for socket tracking
A unique feature of a sock being returned from a hook such as sk_alloc is that
the socket it refers to can be tracked. Most networking hooks in the network stack
are run in a context that is not that of the process that owns the socket for which
the actions relate; this is because networking happens asynchronously and not
entirely in-line with the process. The sk_alloc hook does, however, occur in the
context of the process, such that the task, the PID, and the TGID are of the process
that requested that the socket was created.
Specifying socket tracking tells Tetragon to store a mapping between the socket
and the process’ PID and TGID; and to use that mapping when it sees the socket in a
sock argument in another hook to replace the PID and TGID of the context with the
process that actually owns the socket. This can be done by adding a returnArgAction
to the call. Available actions are TrackSock and UntrackSock.
See TrackSock and UntrackSock.
The kprobe definition creates a kprobe for each item in the list and shares the rest
of the config specified for kprobe.
List can also specify type field that implies extra checks on the values (like for syscall type)
or denote that the list is generated automatically (see below).
User must specify syscall type for list with syscall functions. Also syscall functions
can’t be mixed with regular functions in the list.
The additional selector configuration is shared with all functions in the list.
In following example we create 3 kprobes that share the same pid filter.
The generated_ftrace type of list that generates functions from ftrace available_filter_functions
file with specified filter. The filter is specified with pattern field and expects regular expression.
Following example traces all kernel ksys_* functions for /usr/bin/kill binary.
Options array is passed and processed by each hook used in the spec file that
supports options. At the moment it’s availabe for kprobe and uprobe hooks.
This option disables kprobe multi link interface for all the kprobes defined in
the spec file. If enabled, all the defined kprobes will be atached through standard
kprobe interface. It stays enabled for another spec file without this option.
This option disables uprobe multi link interface for all the uprobes defined in
the spec file. If enabled, all the defined uprobes will be atached through standard
uprobe interface. It stays enabled for another spec file without this option.
It takes boolean as value, by default it’s false.
Example:
options: - name:"disable-uprobe-multi"value:"1"
4 - Selectors
Perform in-kernel BPF filtering and actions on events
Selectors are a way to perform in-kernel BPF filtering on the events to
export, or on the events on which to apply an action.
A TracingPolicy can contain from 0 to 5 selectors. A selector is composed of
1 or more filters. The available filters are the following:
Arguments filters can be specified under the matchArgs field and provide
filtering based on the value of the function’s argument.
In the next example, a selector is defined with a matchArgs filter that tells
the BPF code to process only the function call for which the second argument,
index equal to 1, concerns the file under the path /etc/passwd or
/etc/shadow. It’s using the operator Equal to match against the value of
the argument.
Note that conveniently, we can match against a path directly when the argument
is of type file.
In this situation, an event will be created every time a process tries to
access a file under /etc.
Although it makes less sense, you can also match over the first argument, to
only detect events that will use the file descriptor 4, which is usually the
first that come afters stdin, stdout and stderr in process. And combine that
with the previous example.
Arguments filters can be specified under the returnMatchArgs field and
provide filtering based on the value of the function return value. It allows
you to filter on the return value, thus success, error or value returned by a
kernel call.
matchReturnArgs:- operator:"NotEqual"values: - 0
The available operators for matchReturnArgs are:
Equal
NotEqual
Prefix
Postfix
A use case for this would be to detect the failed access to certain files, like
/etc/shadow. Doing cat /etc/shadow will use a openat syscall that will
returns -1 for a failed attempt with an unprivileged user.
PIDs filter
PIDs filters can be specified under the matchPIDs field and provide filtering
based on the value of host pid of the process. For example, the following
matchPIDs filter tells the BPF code that observe only hooks for which the
host PID is equal to either pid1 or pid2 or pid3:
Another example can be to collect all processes not associated with a
container’s init PID, which is equal to 1. In this way, we are able to detect
if there was a kubectl exec performed inside a container because processes
created by kubectl exec are not children of PID 1.
Binary filters can be specified under the matchBinaries field and provide
filtering based on the value of a certain binary name. For example, the
following matchBinaries selector tells the BPF code to process only system
calls and kernel functions that are coming from cat or tail.
The values field has to be a map of strings. The default behaviour
is followForks: true, so all the child processes are followed.
The current limitation is 4 values.
Follow children
the matchBinaries filter can be configured to also apply to children of matching processes. To do
this, set followChildren to true. For example:
There are a number of limitations when using followChildren:
Children created before the policy was installed will not be matched
The number of matchBinaries sections with followChildren: true cannot exceed 64.
Operators other than In are not supported.
Further examples
One example can be to monitor all the sys_write system calls which are
coming from the /usr/sbin/sshd binary and its child processes and writing to
stdin/stdout/stderr.
This is how we can monitor what was written to the console by different users
during different ssh sessions. The matchBinaries selector in this case is the
following:
- call:"sys_write"syscall:trueargs: - index:0type:"int" - index:1type:"char_buf"sizeArgIndex:3 - index:2type:"size_t"selectors:# match to /sbin/sshd - matchBinaries: - operator:"In"values: - "/usr/sbin/sshd"# match to stdin/stdout/stderrmatchArgs: - index:0operator:"Equal"values: - "1" - "2" - "3"
Namespaces filter
Namespaces filters can be specified under the matchNamespaces field and
provide filtering of calls based on Linux namespace. You can specify the
namespace inode or use the special host_ns keyword, see the example and
description for more information.
This will match if: [Pid namespace is 4026531836] OR [Pid namespace is
4026531835]
namespace can be: Uts, Ipc, Mnt, Pid, PidForChildren, Net,
Cgroup, or User. Time and TimeForChildren are also available in Linux
>= 5.6.
operator can be In or NotIn
values can be raw numeric values (i.e. obtained from lsns) or "host_ns"
which will automatically be translated to the appropriate value.
Limitations
We can have up to 4 values. These can be both numeric and host_ns inside
a single namespace.
We can have up to 4 namespace values under matchNamespaces in Linux
kernel < 5.3. In Linux >= 5.3 we can have up to 10 values (i.e. the maximum
number of namespaces that modern kernels provide).
This will match if: ([Pid namespace is 4026531836] OR [Pid namespace is
4026531835]) AND ([Mnt namespace is 4026531833] OR [Mnt namespace
is 4026531834])
Use cases examples
Generate a kprobe event if /etc/shadow was opened by /bin/cat which
either had host Net or Mnt namespace access
We have [Selector1 OR Selector2]. Inside each selector we have filters.
Both selectors have 3 filters (i.e. matchBinaries, matchArgs, and
matchNamespaces) with different arguments. Adding a - in the beginning of a
filter will result in a new selector.
So the previous CRD will match if:
[binary == /bin/cat AND arg1 == /etc/shadow AND MntNs == host]OR[binary == /bin/cat AND arg1 == /etc/shadow AND NetNs is host]
We can modify the previous example as follows:
Generate a kprobe event if /etc/shadow was opened by /bin/cat which has
host Net and Mnt namespace access
Here we have a single selector. This CRD will match if:
[binary == /bin/cat AND arg1 == /etc/shadow AND(MntNs == host AND
NetNs == host)]
Capabilities filter
Capabilities filters can be specified under the matchCapabilities field and
provide filtering of calls based on Linux capabilities in the specific sets.
This will match if: [Effective capabilities contain CAP_CHOWN] OR
[Effective capabilities contain CAP_NET_RAW]
type can be: Effective, Inheritable, or Permitted.
operator can be In or NotIn
values can be any supported capability. A list of all supported
capabilities can be found in /usr/include/linux/capability.h.
Limitations
There is no limit in the number of capabilities listed under values.
Only one type field can be specified under matchCapabilities.
Namespace changes filter
Namespace changes filter can be specified under the matchNamespaceChanges
field and provide filtering based on calls that are changing Linux namespaces.
This filter can be useful to track execution of code in a new namespace or even
container escapes that change their namespaces.
For instance, if an unprivileged process creates a new user namespace, it gains
full privileges within that namespace. This grants the process the ability to
perform some privileged operations within the context of this new namespace
that would otherwise only be available to privileged root user. As a result, such
filter is useful to track namespace creation, which can be abused by untrusted
processes.
To keep track of the changes, when a process_exec happens, the namespaces of
the process are recorded and these are compared with the current namespaces on
the event with a matchNamespaceChanges filter.
The unshare command, or executing in the host namespace using nsenter can
be used to test this feature. See a
demonstration example
of this feature.
Capability changes filter
Capability changes filter can be specified under the matchCapabilityChanges
field and provide filtering based on calls that are changing Linux capabilities.
To keep track of the changes, when a process_exec happens, the capabilities
of the process are recorded and these are compared with the current
capabilities on the event with a matchCapabilityChanges filter.
Actions filters are a list of actions that execute when an appropriate selector
matches. They are defined under matchActions and currently, the following
action types are supported:
NoteSigkill, Override, FollowFD, UnfollowFD, CopyFD, Post,
TrackSock and UntrackSock are
executed directly in the kernel BPF code while GetUrl and DnsLookup are
happening in userspace after the reception of events.
Sigkill action
Sigkill action terminates synchronously the process that made the call that
matches the appropriate selectors from the kernel. In the example below, every
sys_write system call with a PID not equal to 1 or 0 attempting to write to
/etc/passwd will be terminated. Indeed when using kubectl exec, a new
process is spawned in the container PID namespace and is not a child of PID 1.
Caution Please consult the Enforcement section if you plan to use
this action for enforcement.
Override action
Override action allows to modify the return value of call. While Sigkill
will terminate the entire process responsible for making the call, Override
will run in place of the original kprobed function and return the value
specified in the argError field. It’s then up to the code path or the user
space process handling the returned value to whether stop or proceed with the
execution.
For example, you can create a TracingPolicy that intercepts sys_symlinkat
and will make it return -1 every time the first argument is equal to the
string /etc/passwd:
Override uses the kernel error injection framework and is only available
on kernels compiled with CONFIG_BPF_KPROBE_OVERRIDE configuration option.
Overriding system calls is the primary use case, but there are other kernel
functions that support error injections too. These functions are annotated
with ALLOW_ERROR_INJECTION() in the kernel source, and can be identified by
reading the file /sys/kernel/debug/error_injection/list.
Starting from kernel version 5.7 overriding security_ hooks is also possible.
Caution For kernel developers: if you want to override your kernel functions then
ensure they properly follow the Error Injectable Functions guide.
FollowFD action
The FollowFD action allows to create a mapping using a BPF map between file
descriptors and filenames. After its creation, the mapping can be maintained
through UnfollowFD and CopyFD
actions. Note that proper maintenance of the mapping is up to the tracing policy
writer.
FollowFD is typically used at hook points where a file descriptor and its
associated filename appear together. The kernel function fd_install
is a good example.
The fd_install kernel function is called each time a file descriptor must be
installed into the file descriptor table of a process, typically referenced
within system calls like open or openat. It is a good place for tracking
file descriptor and filename matching.
This action uses the dedicated argFd and argName fields to get respectively
the index of the file descriptor argument and the index of the name argument in
the call.
While the mapping between the file descriptor and filename remains in place
(that is, between FollowFD and UnfollowFD for the same file descriptor)
tracing policies may refer to filenames instead of file descriptors. This
offers greater convenience and allows more functionality to reside inside the
kernel, thereby reducing overhead.
For instance, assume that you want to prevent writes into file
/etc/passwd. The system call sys_write only receives a file descriptor,
not a filename, as argument. Yet with a bracketing pair of FollowFD
and UnfollowFD actions in place the tracing policy that hooks into sys_write
can nevertheless refer to the filename /etc/passwd,
if it also marks the relevant argument as of type fd.
The following example combines actions FollowFD and UnfollowFD as well
as an argument of type fd to such effect:
The UnfollowFD action takes a file descriptor from a system call and deletes
the corresponding entry from the BPF map, where it was put under the FollowFD
action.
It is typically used at hooks points where the scope of association between
a file descriptor and a filename ends. The system call sys_close is a
good example.
Similar to the FollowFD action, the index of the file descriptor is described
under argFd:
matchActions:- action:UnfollowFDargFd:0
In this example, argFD is 0. So, the argument from the sys_close system
call at index: 0 will be deleted from the BPF map whenever a sys_close is
executed.
- index:0type:"int"
Caution Whenever we would like to follow a file descriptor with a FollowFD block,
there should be a matching UnfollowFD block, otherwise the BPF map will be
broken.
CopyFD action
The CopyFD action is specific to duplication of file descriptor use cases.
Similary to FollowFD, it takes an argFd and argName arguments. It can
typically be used tracking the dup, dup2 or dup3 syscalls.
The GetUrl action can be used to perform a remote interaction such as
triggering Thinkst canaries or any system that can be triggered via an URL
request. It uses the argUrl field to specify the URL to request using GET
method.
matchActions:- action:GetUrlargUrl:http://ebpf.io
DnsLookup action
The DnsLookup action can be used to perform a remote interaction such as
triggering Thinkst canaries or any system that can be triggered via an DNS
entry request. It uses the argFqdn field to specify the domain to lookup.
matchActions:- action:DnsLookupargFqdn:ebpf.io
Post action
The Post action allows an event to be transmitted to the agent, from
kernelspace to userspace. By default, all TracingPolicy hook will create an
event with the Post action except in those situations:
a NoPost action was specified in a matchActions;
a rate-limiting parameter is in place, see details below.
This action allows you to specify parameters for the Post action.
Rate limiting
Post takes the rateLimit parameter with a time value. This value defaults
to seconds, but post-fixing ’m’ or ‘h’ will cause the value to be interpreted
in minutes or hours. When this parameter is specified for an action, that
action will check if the same action has fired, for the same thread, within
the time window, with the same inspected arguments. (Only the first 40 bytes
of each inspected argument is used in the matching. Only supported on kernels
v5.3 onwards.)
For example, you can specify a selector to only generate an event every 5
minutes with adding the following action and its paramater:
matchActions:- action:PostrateLimit:5m
By default, the rate limiting is applied per thread, meaning that only repeated
actions by the same thread will be rate limited. This can be expanded to all
threads for a process by specifying a rateLimitScope with value “process”; or
can be expanded to all processes by specifying the same with the value “global”.
Stack traces
Post takes the kernelStackTrace parameter, when turned to true (by default to
false) it enables dump of the kernel stack trace to the hook point in kprobes
events. To dump user space stack trace set userStackTrace parameter to true.
For example, the following kprobe hook can be used to retrieve the
kernel stack to kfree_skb_reason, the function called in the kernel to drop
kernel socket buffers.
By default Tetragon does not expose the linear addresses from kernel space or
user space, you need to enable the flag --expose-stack-addresses to get the
addresses along the rest.
Note that the Tetragon agent is using its privilege to read the kernel symbols
and their address. Being able to retrieve kernel symbols address can be used to
break kernel address space layout randomization (KASLR) so only privileged users
should be able to enable this feature and read events containing stack traces.
The same thing we can say about retrieving address for user mode processes.
Stack trace addresses can be used to bypass address space layout randomization (ASLR).
Once loaded, events created from this policy will contain a new kernel_stack_trace
field on the process_kprobe event with an output similar to:
The “address” is the kernel function address, “offset” is the offset into the
native instruction for the function and “symbol” is the function symbol name.
User mode stack trace is contained in user_stack_trace field on the
process_kprobe event and looks like:
The “address” is the function address, “offset” is the function offset from the
beginning of the binary module. “module” is the absolute path of the binary file
to which address belongs. “symbol” is the function symbol name. “symbol” may be missing
if the binary file is stripped.
Note
Information from procfs (/proc/<pid>/maps) is used to symbolize user
stack trace addresses. Stack trace addresses extraction and symbolizing are async.
It might happen that process is terminated and the /proc/<pid>/maps file will be
not existed at user stack trace symbolization step. In such case user stack traces
for very short living process might be not collected.
For Linux kernels before 5.15 user stack traces may be incomplete (some stack
traces entries may be missed).
This output can be enhanced in a more human friendly using the tetra getevents -o compact command. Indeed, by default, it will print the stack trace along
the compact output of the event similarly to this:
The printing format for kernel stack trace is "0x%x: %s+0x%x", address, symbol, offset.
The printing format for user stack trace is "0x%x: %s (%s+0x%x)", address, symbol, module, offset.
Note Compact output will display missing addresses as 0x0, see the above note on
--expose-stack-addresses for more info.
NoPost action
The NoPost action can be used to suppress the event to be generated, but at
the same time all its defined actions are performed.
It’s useful when you are not interested in the event itself, just in the action
being performed.
Following example override openat syscall for “/etc/passwd” file but does not
generate any event about that.
The TrackSock action allows to create a mapping using a BPF map between sockets
and processes. It however needs to maintain a state
correctly, see UntrackSock related action. TrackSock
works similarly to FollowFD, specifying the argument with the sock type using
argSock instead of specifying the FD argument with argFd.
It is however more likely that socket tracking will be performed on the return
value of sk_alloc as described above.
Socket tracking is only available on kernel >=5.3.
UntrackSock action
The UntrackSock action takes a struct sock pointer from a function call and deletes
the corresponding entry from the BPF map, where it was put under the TrackSock
action.
Similar to the TrackSock action, the index of the sock is described under argSock:
- matchActions: - action:UntrackSockargSock:0
In this example, argSock is 0. So, the argument from the __sk_free function
call at index: 0 will be deleted from the BPF map whenever a __sk_free is
executed.
- index:0type:"sock"
Caution Whenever we would like to track a socket with a TrackSock block,
there should be a matching UntrackSock block, otherwise the BPF map will be
broken.
Socket tracking is only available on kernel >=5.3.
Notify Enforcer action
The NotifyEnforcer action notifies the enforcer program to kill or override a syscall.
It’s meant to be used on systems with kernel that lacks multi kprobe feature, that
allows to attach many kprobes quickly). To workaround that the enforcer sensor uses
the raw syscall tracepoint and attaches simple program to syscalls that we need to
kill or override.
The specs needs to have enforcer program definition, that instructs tetragon to load
the enforcer program and attach it to specified syscalls.
spec:enforcers: - calls: - "list:dups"
The syscalls expects list of syscalls or list:XXX pointer to list.
Note that currently only single enforcer definition is allowed.
If specified the argError will be passed to bpf_override_return helper to override the syscall return value.
If specified the argSig will be passed to bpf_send_signal helper to override the syscall return value.
The following is spec for killing /usr/bin/bash program whenever it calls sys_dup or sys_dup2 syscalls.
Note as mentioned above the NotifyEnforcer with enforcer program is meant to be used only on kernel versions
with no support for fast attach of multiple kprobes (kprobe_multi link).
With kprobe_multi link support the above example can be easily replaced with:
The selector semantics of the CiliumTracingPolicy follows the standard
Kubernetes semantics and the principles that are used by Cilium to create a
unified policy definition.
To explain deeper the structure and the logic behind it, let’s consider first
the following example:
In the YAML above matchPIDs and matchArgs are logically AND together
giving the expression:
(pid in {pid1, pid2, pid3} AND arg0=fdstring1)
Multiple values
When multiple values are given, we apply the OR operation between them. In
case of having multiple values under the matchPIDs selector, if any value
matches with the given pid from pid1, pid2 or pid3 then we accept the
event:
pid==pid1 OR pid==pid2 OR pid==pid3
As an example, we can filter for sys_read() syscalls that were not part of
the container initialization and the main pod process and tried to read from
the /etc/passwd file by using:
When multiple operators are supported under matchPIDs or matchArgs, they
are logically AND together. In case if we have multiple operators under
matchPIDs:
Both Equal and NotEqual are set operations. This means if multiple values
are specified, they are ORd together in case of Equal, and ANDd together
in case of NotEqual.
For example, in case of Equal the following YAML snippet matches if the
argument at index 0 is in the set of {arg0, arg1, arg2}.
The value can be specified as hexadecimal (with 0x prefix) octal (with 0 prefix)
or decimal value (no prefix).
The operator Prefix checks if the certain argument starts with the defined value,
while the operator Postfix compares if the argument matches to the defined value
as trailing.
The operators relating to ports, addresses and protocol are used with sock or skb
types. Port operators can accept a range of ports specified as min:max as well
as lists of individual ports. Address operators can accept IPv4/6 CIDR ranges as well
as lists of individual addresses.
The Protocol operator can accept integer values to match against, or the equivalent
IPPROTO_ enumeration. For example, UDP can be specified as either IPPROTO_UDP or 17;
TCP can be specified as either IPPROTO_TCP or 6.
The Family operator can accept integer values to match against or the equivalent
AF_ enumeration. For example, IPv4 can be specified as either AF_INET or 2; IPv6
can be specified as either AF_INET6 or 10.
The State operator can accept integer values to match against or the equivalent
TCP_ enumeration. For example, an established socket can be matched with
TCP_ESTABLISHED or 1; a closed socket with TCP_CLOSE or 7.
In case of matchPIDs:
In
NotIn
The operator types In and NotIn are used to test whether the pid of a
system call is found in the provided values list in the CR. Both In and
NotIn are set operations, which means if multiple values are specified they
are ORd together in case of In and ANDd together in case of NotIn.
For example, in case of In the following YAML snippet matches if the pid of a
certain system call is being part of the list of {0, 1}:
The In operator type is used to test whether a binary name of a system call
is found in the provided values list. For example, the following YAML snippet
matches if the binary name of a certain system call is being part of the list
of {binary0, binary1, binary2}:
(pid in {pid1, pid2, pid3} AND arg0=1 AND arg2 < 500) OR(pid in {pid1, pid2, pid3} AND arg0=2)
Limitations
Those limitations might be outdated, see issue #709.
Because BPF must be bounded we have to place limits on how many selectors can
exist.
Max Selectors 8.
Max PID values per selector 4
Max MatchArgs per selector 5 (one per index)
Max MatchArg Values per MatchArgs 1 (limiting initial implementation can bump
to 16 or so)
Return Actions filter
Return actions filters are a list of actions that execute when an return selector
matches. They are defined under matchReturnActions and currently support all
the Actions filteraction types.
5 - Tags
Use Tags to categorize events
Tags are optional fields of a Tracing Policy that are used to categorize
generated events.
Introduction
Tags are specified in Tracing policies and will be part of the generated event.
apiVersion:cilium.io/v1alpha1kind:TracingPolicymetadata:name:"file-monitoring-filtered"spec:kprobes: - call:"security_file_permission"message:"Sensitive file system write operation"syscall:falseargs: - index:0type:"file"# (struct file *) used for getting the path - index:1type:"int"# 0x04 is MAY_READ, 0x02 is MAY_WRITEselectors: - matchArgs: - index:0operator:"Prefix"values: - "/etc"# Writes to sensitive directories - "/boot" - "/lib" - "/lib64" - "/bin" - "/usr/lib" - "/usr/local/lib" - "/usr/local/sbin" - "/usr/local/bin" - "/usr/bin" - "/usr/sbin" - "/var/log"# Writes to logs - "/dev/log" - "/root/.ssh"# Writes to sensitive files add here. - index:1operator:"Equal"values: - "2"# MAY_WRITEtags:["observability.filesystem","observability.process"]
Every kprobe call can have up to max 16 tags.
Namespaces
Observability namespace
Events in this namespace relate to collect and export data about the internal system state.
“observability.filesystem”: the event is about file system operations.
“observability.privilege_escalation”: the event is about raising permissions of a user or a process.
“observability.process”: the event is about an instance of a Linux program being executed.
User defined Tags
Users can define their own tags inside Tracing Policies. The official supported tags are documented
in the Namespaces section.
6 - Kubernetes Identity Aware Policies
Tetragon in-kernel filtering based on Kubernetes namespaces, pod labels, and container fields
Motivation
Tetragon is configured via TracingPolicies. Broadly
speaking, TracingPolicies define what situations Tetragon should react to and how. The what
can be, for example, specific system calls with specific argument values. The how defines what
action the Tetragon agent should perform when the specified situation occurs. The most common action
is generating an event, but there are others (e.g., returning an error without executing the function
or killing the corresponding process).
Here, we discuss how to apply tracing policies only on a subset of pods running on the system via
the followings mechanisms:
namespaced policies
pod-label filters
container field filters
Tetragon implements these mechanisms in-kernel via eBPF. This is important for both observability
and enforcement use-cases.
For observability, copying only the relevant events from kernel- to user-space reduces overhead. For
enforcement, performing the enforcement action in the kernel avoids the race-condition of doing it
in user-space. For example, let us consider the case where we want to block an application from
performing a system call. Performing the filtering in-kernel means that the application will never
finish executing the system call, which is not possible if enforcement happens in user-space
(after the fact).
To ensure that namespaced tracing policies are always correctly applied, Tetragon needs to perform
actions before containers start executing. Tetragon supports this via OCI runtime
hooks. If
such hooks are not added, Tetragon will apply policies in a best-effort manner using information
from the k8s API server.
Namespace filtering
For namespace filtering we use TracingPolicyNamespaced which has the same contents as a
TracingPolicy, but it is defined in a specific namespace and it is only applied to pods of that
namespace.
Pod label filters
For pod label filters, we use the PodSelector field of tracing policies to select the pods that
the policy is applied to.
Container field filters
For container field filters, we use the containerSelector field of tracing policies to select the containers that the policy is applied to. At the moment, the only supported field is name.
Demo
Setup
For this demo, we use containerd and configure appropriate run-time hooks using minikube.
First, let us start minikube, build and load images, and install Tetragon and OCI hooks:
For illustration purposes, we will use the lseek system call with an invalid
argument. Specifically a file descriptor (the first argument) of -1. Normally,
this operation would return a “Bad file descriptor error”.
Let us start a pod in the default namespace:
kubectl -n default run test --image=python -it --rm --restart=Never -- python
Above command will result in the following python shell:
If you don't see a command prompt, try pressing enter.
>>>
There is no policy installed, so attempting to do the lseek operation will just
return an error. Using the python shell, we can execute an lseek and see the
returned error.
>>> import os
>>> os.lseek(-1,0,0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
>>>
In another terminal, we install a policy in the default namespace:
The above tracing policy will kill the process that performs a lseek system call with a file
descriptor of -1. Note that we use a SigKill action only for illustration purposes because it’s
easier to observe its effects.
Then, attempting the lseek operation on the previous terminal, will result in the process getting
killed:
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)
The same is true for a newly started container:
kubectl -n default run test --image=python -it --rm --restart=Never -- python
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)
Doing the same on another namespace:
kubectl create namespace testkubectl -n test run test --image=python -it --rm --restart=Never -- python
Will not kill the process and result in an error:
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
Pod label filters
Let’s install a tracing policy with a pod label filter.
kubectl run test --image=python -it --rm --restart=Never -- python
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OSError: [Errno 9] Bad file descriptor
>>>
But pods with the label will:
kubectl run test --labels "app=lseek-test" --image=python -it --rm --restart=Never -- python
If you don't see a command prompt, try pressing enter.
>>> import os
>>> os.lseek(-1, 0, 0)
pod "test" deleted
pod default/test terminated (Error)
Container field filters
Let’s install a tracing policy with a container field filter.