The scenario for inducing this is as follows:
1. Start a container with a long stop timeout and a PID1 that
ignores SIGTERM
2. Use `podman stop` to stop that container
3. Simultaneously, in another terminal, kill -9 `pidof podman`
(the container is now in ContainerStateStopping)
4. Now kill that container's Conmon with SIGKILL.
5. No commands are able to move the container from Stopping to
Stopped now.
The cause is a logic bug in our exit-file handling logic. Conmon
being dead without an exit file causes no change to the state.
Add handling for this case that tries to clean up, including
stopping the container if it still seems to be running.
Fixes#19629
Addresses: https://issues.redhat.com/browse/ACCELFIX-250
Signed-off-by: Matt Heon <mheon@redhat.com>
Signed-off-by: tomsweeneyredhat <tsweeney@redhat.com>
Cherry pick from #20329
Addresses: https://issues.redhat.com/browse/RHEL-14744 and
https://issues.redhat.com/browse/RHEL-14743
When containers are created with a named volume it can deadlock because
the create logic tried to lock all volumes in a loop, this is fine if it
only ever creates a single container at any given time. However because
we multiple containers can be created at the same time they can cause a
deadlock between the volumes. This is because the order of the loop is
not stable, in fact it is based on the order of how the volumes were
specified on the cli.
So if you create two containers at the same time with
`-v vol1:/dir2 -v vol2:/dir2` and the other one with
`-v vol2:/dir2 -v vol1:/dir1` then there is chance for a deadlock.
Now one solution could be to order the volumes to prevent the issue but
the reason for holding the lock is dubious. The goal was to prevent the
volume from being removed in the meantime. However that could still
have happend before we acquired the lock so it didn't protect against
that.
Both boltdb and sqlite already prevent us from adding a container with
volumes that do not exists due their internal consistency checks.
Sqlite even uses FOREIGN KEY relationships so the schema will prevent us
from doing anything wrong.
The create code currently first checks if the volume exists and if not
creates it. I have checked that the db will guarantee that this will not
work:
Boltdb: `no volume with name test2 found in database when adding container xxx: no such volume`
Sqlite: `adding container volume test2 to database: FOREIGN KEY constraint failed`
Keep in mind that this error is normally not seen, only if the volume is
removed between the volume exists check and adding the container in the
db this messages will be seen wich is an acceptable race and a
pre-existing condition anyway.
[NO NEW TESTS NEEDED] Race condition, hard to test in CI.
Fixes#20313
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: tomsweeneyredhat <tsweeney@redhat.com>
The update to runc broke creation of devices for containers in
the pod cgroup. We don't support the device cgroup for pods at
present, so just disable it for now, resolving the issue.
Thanks to Giuseppe for finding the fix.
[NO NEW TESTS NEEDED] fixes a test break
Signed-off-by: Matt Heon <mheon@redhat.com>
Particularly, fix an issue with SELinux where image volumes would
not mount properly.
Based (loosely) on 2ec11b16abe26297331333ceb9aea3b908dba7b0 which
intermingled these changes with other fixes not relevant to this
backport.
Backported to v4.4.1-rhel per RHBZ 2213843
[NO NEW TESTS NEEDED]
Signed-off-by: Matt Heon <mheon@redhat.com>
When a userns is set we setup the network after the bind mounts, at the
point where resolv.conf is generated we do not yet know the subnet.
Just like the other dns servers for bridge networks we need to add the
ip later in completeNetworkSetup()
Addresses: https://bugzilla.redhat.com/show_bug.cgi?id=2182492 and
https://bugzilla.redhat.com/show_bug.cgi?id=2182491
This is targeted to RHEL 8.8 and 9.2 ZeroDay
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Signed-off-by: tomsweeneyredhat <tsweeney@redhat.com>
As described in #17777, the `restart` on-failure action did not behave
correctly when the health check is being run by a transient systemd
unit. It ran just fine when being executed outside such a unit, for
instance, manually or, as done in the system tests, in a scripted
fashion.
There were two issue causing the `restart` on-failure action to
misbehave:
1) The transient systemd units used the default `KillMode=cgroup` which
will nuke all processes in the specific cgroup including the recently
restarted container/conmon once the main `podman healthcheck run`
process exits.
2) Podman attempted to remove the transient systemd unit and timer
during restart. That is perfectly fine when manually restarting the
container but not when the restart itself is being executed inside
such a transient unit. Ultimately, Podman tried to shoot itself in
the foot.
Fix both issues by moving the restart logic in the cleanup process.
Instead of restarting the container, the `healthcheck run` will just
stop the container and the cleanup process will restart the container
once it has turned unhealthy.
Backport of commit 95634154303f5b8c3d5c92820e2a3545c54f0bc8.
Fixes: #17777
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180125
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=2180126
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
* Utils must support higher level API to create Tar with chrooted into
directory
* Volume export: use TarwithChroot instead of Tar so we can make sure no
symlink can be exported by tar if it exists outside of the source
directory.
* container export: use chroot and Tar instead of Tar so we can make sure no
symlink can be exported by tar if it exists outside of the mointPoint.
[NO NEW TESTS NEEDED]
[NO TESTS NEEDED]
Race needs combination of external/in-container mechanism which is hard to repro in CI.
CVE: https://access.redhat.com/security/cve/CVE-2023-0778
Signed-off-by: Aditya R <arajan@redhat.com>
MH: Cherry-pick to v4.4.1-rhel per RHBZ 2169618
Signed-off-by: Matt Heon <mheon@redhat.com>
If the image being used has a user set that is a positive
integer greater than 0, then set the securityContext.runAsNonRoot
to true for the container in the generated kube yaml.
Signed-off-by: Urvashi Mohnani <umohnani@redhat.com>
In the super rare case that there are two containers with the same ID
for two different users, podman logs with the journald driver would show
logs from both containers.
[NO NEW TESTS NEEDED] Impossible to reproduce.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
I noticed this while running some things in parallel, podman events
would show events from other users. Because all events are written to
the journal everybody can see them. So when we read the journal we must
filter events for only the current UID.
To reproduce run `podman events` as user then in another window create a
container as root for example. After this patch it will correctly ignore
these events from other users.
[NO NEW TESTS NEEDED] I don't think we can test with two users at the same
time.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Loading container states speed things up when listing all containers but
it comes with a price tag for many other call paths. Hence, make
loading the state conditional to allow for keeping `podman ps` fast
without other commands regressing in performance.
[NO NEW TESTS NEEDED]
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Do not sync containers with the runtime and the database when listing
containers. It turns out to be extremely expensive and unnecessary.
The sync was needed since listing all containers from the database did
not populate their state. Doing that, however, is much faster since we
already have a connection to the database.
This change makes listing 200 containers 2 times faster than before.
[NO NEW TESTS NEEDED]
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Also do not return (and immediately suppress) an error if no health
check is defined for a given container.
Makes listing 100 containers around 10 percent faster.
[NO NEW TESTS NEEDED]
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
After https://github.com/containers/netavark/pull/452 `netavark` is
incharge of deciding `custom_dns_servers` if any so lets honor that and
libpod should not set these manually.
This also ensures docker parity
Podman populates container's `/etc/resolv.conf` with custom DNS servers ( specified via `--dns` or `dns_server` in containers.conf )
even when container is connected to a network where `dns_enabled` is `true`.
Current behavior does not matches with docker, hence following commit ensures that podman only populates custom DNS server when container is not connected to any network where DNS is enabled and for the cases where `dns_enabled` is `true`
the resolution for custom DNS server will happen via ( `aardvark-dns` or `dnsname` ).
Reference: https://docs.docker.com/config/containers/container-networking/#dns-services
Closes: containers#16172
Signed-off-by: Aditya R <arajan@redhat.com>
Aardvark-dns and netavark now accepts custom DNS servers for containers
via new config field `dns_servers`. New field allows containers to use
custom resolvers instead of host's default resolvers.
Following commit instruments libpod to pass these custom DNS servers set
via `--dns` or central config to the network stack.
Depends-on:
* Common: containers/common#1189
* Netavark: containers/netavark#452
* Aardvark-dns: containers/aardvark-dns#240
Signed-off-by: Aditya R <arajan@redhat.com>
Kill is a fast syscall, so we can reduce the sleep time from 100ms to
10ms in hope to speed things up a bit.
[NO NEW TESTS NEEDED]
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Commit 067442b5701f improved stopping/killing a container by detecting
whether the cleanup process has already fired and changed the state of
the container. Further improve on that by returning early instead of
trying to wait for the PID to finish. At that point we know that the
container has exited but the previous PID may have been recycled
already by the kernel.
[NO NEW TESTS NEEDED] - the absence of the two flaking tests recorded
in #17142 will tell.
Fixes: #17142
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Add a comment when SIGKILL is being used. It may help future readers
better comprehend what's going on and why.
[NO NEW TESTS NEEDED] - cannot test a comment :^)
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Move the stopSignal decl into the branch where it's actually used.
[NO NEW TESTS NEEDED] as it's just a small refactor.
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
The code can be simplified by using a timer directly.
[NO NEW TESTS NEEDED] - should not change behavior.
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Reserved annotations are used internally by Podman and would effect
nothing when run with Kubernetes so we should not be generating these
annotations.
Fixes: https://github.com/containers/podman/issues/17105
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
The container lock is released before stopping/killing which implies
certain race conditions with, for instance, the cleanup process changing
the container state to stopped, exited or other states.
The (remaining) flakes seen in #16142 and #15367 strongly indicate a
race in between the stopping/killing a container and the cleanup
process. To fix the flake make sure to ignore invalid-state errors.
An alternative fix would be to change `KillContainer` to not return such
errors at all but commit c77691f06f61 indicates an explicit desire to
have these errors being reported in the sig proxy.
[NO NEW TESTS NEEDED] as it's a race already covered by the system
tests.
Fixes: #16142Fixes: #15367
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Every time I look at a container-removal issue I wonder why the
container isn't locked directly here, so let's add a comment here.
I am not sure whether I would be better if callers took care of
locking but for now the comment will safe the future me and probably
other readers some time.
[NO NEW TESTS NEEDED]
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
This is a cleaner solution and guarantees the variables
will be used before they are initialized.
[NO NEW TESTS NEEDED]
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
The StoppedByUser variable indicates that the container was
requested to stop by a user. It's used to prevent restart policy
from firing (so that a restart=always container won't restart if
the user does a `podman stop`. The problem is we were setting it
*very* late in the stop() function. Originally, this was fine,
but after the changes to add the new Stopping state, the logic
that triggered restart policy was firing before StoppedByUser was
even set - so the container would still restart.
Setting it earlier shouldn't hurt anything and guarantees that
checks will see that the container was stopped manually.
Fixes#17069
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
While manually playing with --service-container, I encountered a number
of too verbose logs. For instance, there's no need to error-log when
the service-container has already been stopped.
For testing, add a new kube test with a multi-pod YAML which will
implicitly show that #17024 is now working.
Fixes: #17024
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
Every podman command is paying the price for this compile even when they
don't use the Regex, this will speed up start of podman by a little.
[NO NEW TESTS NEEDED] Existing tests should catch issues.
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
follow-up to 6886e80b45caae27dda81a9b44d8dd179c414580
when "podman -rm -f" is used on a container in "stopping" state, also
make sure it is terminated before removing it from the local storage.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
check that the container has a valid pid before attempting to use
kill($PID, 0) on it. If the PID==0, it means the container is already
stopped.
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Do not allow for removing the service container unless all associated
pods have been removed. Previously, the service container could be
removed when all pods have exited which can lead to a number of issues.
Now, the service container is treated like an infra container and can
only be removed along with the pods.
Also make sure that a pod is unlinked from the service container once
it's being removed.
Fixes: #16964
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
if the container has no pid namespace, they are not killed when the
container process ends. In this case, attempt to kill them in the
same way.
The problem was noticed with toolbox where the exec'ed sessions are
not terminated when the container is stopped, blocking the system
shutdown.
[NO NEW TESTS NEEDED]
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
in several top-level API functions. These are the first line of
the function that contains them, which makes sense; we want to
capture any error returned by the function. However, making this
the first defer means that it is the last thing to run after the
function returns - meaning that the container's
`defer c.lock.Unlock()` has already fired, leading to a chance we
modify the container without holding its lock.
We could move the function around so it's no longer the first
defer, but then we'd have to call it twice (immediately after
`defer c.lock.Unlock()` if the container is not batched, and a
second time in a new `else` block right after the lock/sync call
to make sure we handle batched containers). Seems simpler to just
leave it like this.
[NO NEW TESTS NEEDED] Can't really test for DB corruption easily.
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
When you use podman logs with --until and --follow it should exit after
the requested until time and not keep hanging forever.
This fixes the behavior for the k8s-file backend.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
When you use podman logs with --until and --follow it should exit after
the requested until time and not keep hanging forever.
To make this work I reworked the code to use the better journald event
reading code for logs as well. this correctly uses the sd_journal API
without having to compare the cursors to find the EOF.
The same problems exists for the k8s-file driver, I will fix this in the
next commit.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
Instead of reading the full journal which can be expensive we can seek
based on the time.
If you have a journald with many podman events just compare the time
`time podman events --since 1s --stream=false` with and without this
patch.
Signed-off-by: Paul Holzinger <pholzing@redhat.com>
The `containerCouldBeLogging` bool should not be false by default, when
--since is used we seek in the journal and can miss the start event so
that bool would stay false forever. This means that a running container
is not followed even when it should.
To fix this we can just set the `containerCouldBeLogging` bool based on
the current contianer state.
Fixes#16950
Signed-off-by: Paul Holzinger <pholzing@redhat.com>