27 Commits

Author SHA1 Message Date
cf5df5b805 quadlet tests: skip on RHEL8 rootless
skip in setup() if journald unavailable.

To be pedantic, this is overkill: some quadlet tests pass
because they don't run journald. Too bad.

Also skip a play-kube test that requires journal

Signed-off-by: Ed Santiago <santiago@redhat.com>
2023-03-21 07:18:14 -06:00
1541ce56cf kube play: set service container as main PID when possible
Commit 4fa307f14923 fixed a number of issues in the sdnotify proxies.
Whenever a container runs with a custom sdnotify policy, the proxies
need to keep running which in turn required Podman to run and wait for
the service container to stop.  Improve on that behavior and set the
service container as the main PID (instead of Podman) when no container
needs sdnotify.

Fixes: #17345
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2023-02-10 13:31:03 +01:00
2dd9e0859c Merge pull request #16947 from ygalblum/kube-service-container-logdriver
Kube Play: use passthrough as the default log-driver if service-container is set
2023-01-03 09:28:00 -05:00
68fbebfacc Kube Play: use passthrough as the default log-driver if service-container is set
Reasoning
---------
When the log-driver is passthrough, the journal socket is passed to the containers as-is which has two advantages:
1. journald can see who the actual sender of the log event is,
    rather than thinking everything comes from the conmon process
2. conmon will not have to copy all the log data

Code Changes
------------
If log-driver was not set by the user and service-container is set use
passthrough as the default log-driver

Update the system tests
- explicitly set logdriver in sdnotify and play tests
- podman-kube template test:  Verify the default log driver for service-container

Signed-off-by: Ygal Blum <ygal.blum@gmail.com>
2023-01-03 10:34:24 +02:00
16b595c32c Build and use a newer systemd image
...based on f37, not f31. And make it fedora-minimal so it's
smaller. And clean up dnf so it's even smaller. And tag it
with our proper YMD tag, and commit the script that builds it.

This broke the system-df tests. In the process of resolving
that, I found those tests a little lacking. So, improve their
coverage a little bit.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2023-01-02 13:26:46 -07:00
4fa307f149 kube sdnotify: run proxies for the lifespan of the service
As outlined in #16076, a subsequent BARRIER *may* follow the READY
message sent by a container.  To correctly imitate the behavior of
systemd's NOTIFY_SOCKET, the notify proxies span up by `kube play` must
hence process messages for the entirety of the workload.

We know that the workload is done and that all containers and pods have
exited when the service container exits.  Hence, all proxies are closed
at that time.

The above changes imply that Podman runs for the entirety of the
workload and will henceforth act as the MAINPID when running inside of
systemd.  Prior to this change, the service container acted as the
MAINPID which is now not possible anymore; Podman would be killed
immediately on exit of the service container and could not clean up.

The kube template now correctly transitions to in-active instead of
failed in systemd.

Fixes: #16076
Fixes: #16515
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-12-06 14:15:11 +01:00
8c3af71862 notify k8s system test: move sending message into exec
The flake in #16076 is likely related to the notify message not being
delivered/read correctly.  Move sending the message into an exec session
such that flakes will reveal an error message.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-12-05 14:32:06 +01:00
120a77e394 testimage: add iproute2 & socat, for pasta networking
PR #16141 introduces a new network type, "pasta". Its tests
rely on running 'ip -j' and socat in the container. Add them.

Also: bump to alpine 3.16.2 (from 3.16.0)
Also: clean up apk cache, this saves us 2MB+ in the image

Also (unrelated): clean up two broken uses of '$(< ...)' that
are causing tests to blow up under bats 1.8 on my laptop

New testimage is 20221018 and, sigh, is 12.7MB (up 4MB).

Signed-off-by: Ed Santiago <santiago@redhat.com>
2022-10-18 11:50:48 -06:00
d5f044ee7a System tests: reenable some skipped aarch64 tests
Background: in order to add aarch64 tests, we had to add
emergency skips to a lot of failing tests. No attempt was
ever made to understand why they were failing.

Fast forward to today, I filed #15888 just to see if tests
are still failing. Looks like a number of them are fixed.
(Yes, magically). Remove those skips.

See: #15074, #15277

Signed-off-by: Ed Santiago <santiago@redhat.com>
2022-09-21 14:07:22 -06:00
79e21b5b16 kube play: sd-notify integration
Integrate sd-notify policies into `kube play`.  The policies can be
configured for all contianers via the `io.containers.sdnotify`
annotation or for indidivual containers via the
`io.containers.sdnotify/$name` annotation.

The `kube play` process will wait for all containers to be ready by
waiting for the individual `READY=1` messages which are received via
the `pkg/systemd/notifyproxy` proxy mechanism.

Also update the simple "container" sd-notify test as it did not fully
test the expected behavior which became obvious when adding the new
tests.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-08-10 21:12:39 +02:00
3fc126e152 libpod: allow the notify socket to be passed programatically
The notify socket can now either be specified via an environment
variable or programatically (where the env is ignored).  The
notify mode and the socket are now also displayed in `container inspect`
which comes in handy for debugging and allows for propper testing.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-08-10 21:10:17 +02:00
da98c88778 Cirrus: enable Fedora 36 aarch64 tasks on EC2
new file:   test/e2e/config_arm64.go

Tests that fail on aarch64 have been skipped with
`skip_if_aarch64`.

Co-authored-by: Chris Evich <cevich@redhat.com>
Co-authored-by: Ed Santiago <santiago@redhat.com>
Signed-off-by: Lokesh Mandvekar <lsm5@fedoraproject.org>
2022-07-27 15:27:52 -04:00
8684d41e38 k8systemd: run k8s workloads in systemd
Support running `podman play kube` in systemd by exploiting the
previously added "service containers".  During `play kube`, a service
container is started before all the pods and containers, and is stopped
last.  The service container communicates its conmon PID via sdnotify.

Add a new systemd template to dispatch such k8s workloads.  The argument
of the template is the path to the k8s file.  Note that the path must be
escaped for systemd not to bark:

Let's assume we have a `top.yaml` file in the home directory:
```
$ escaped=$(systemd-escape ~/top.yaml)
$ systemctl --user start podman-play-kube@$escaped.service
```

Closes: https://issues.redhat.com/browse/RUN-1287
Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-05-17 10:18:58 +02:00
03af8213ce sdnotify: send MAINPID only once
Send the main PID only once.  Previously, `(*Container).start()` and
the conmon handler sent them ~simultaneously and went into a race.

I noticed the issue while debugging a WIP PR.

Signed-off-by: Valentin Rothberg <vrothberg@redhat.com>
2022-05-12 11:11:37 +02:00
cc42321697 sdnotify test: accept MAINPID anywhere
systemd sometimes spits out lines in the wrong order. Deal with it.

This fixes an infrequent flake that I haven't filed because I
didn't understand it well enough. (Hence, this reduces BUGS
but does not reduce BUG COUNT. Sorry!)

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-09-30 12:09:48 -06:00
e3c7e02a0e System tests: add cleanup & debugging output
Cleanup: the final 'play' test wasn't cleaning up after itself,
leading to angry warning messages when rerunning tests (in
my environment; never in CI)

Debug: I'm seeing a lot of "Could not parse READY=1 as MAINPID=nnn"
flakes in the sdnotify:container test (nine in the past month). Add
debug traces to help diagnose in future flakes.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-09-01 11:29:59 -06:00
c22f3e8b4e Implement SD-NOTIFY proxy in conmon
This leverages conmon's ability to proxy the SD-NOTIFY socket.
This prevents locking caused by OCI runtime blocking, waiting for
SD-NOTIFY messages, and instead passes the messages directly up
to the host.

NOTE: Also re-enable the auto-update tests which has been disabled due
to flakiness.  With this change, Podman properly integrates into
systemd.

Fixes: #7316
Signed-off-by: Joseph Gooch <mrwizard@dok.org>
Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
Signed-off-by: Valentin Rothberg <rothberg@redhat.com>
2021-08-20 11:12:05 +02:00
9fd7ab50f8 System tests: honor $OCI_RUNTIME (for CI)
Some CI systems set $OCI_RUNTIME as a way to override the
default crun. Integration (e2e) tests honor this, but system
tests were not aware of the convention; this means we haven't
been testing system tests with runc, which means RHEL gating
tests are now failing.

The proper solution would be to edit containers.conf on CI
systems. Sorry, that would involve too much CI-VM work.
Instead, this PR detects $OCI_RUNTIME and creates a dummy
containers.conf file using that runtime.

Add: various skips for tests that don't work with runc.

Refactor: add a helper function so we don't need to do
the complicated 'podman info blah blah .OCIRuntime.blah'
thing in many places.

BUG: we leave a tmp file behind on exit.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-05-03 20:15:21 -06:00
660a72993c sdnotify tests: try real hard to kill socat processes
podman gating tests are hanging in the new Fedora CI setup;
long and tedious investigation suggests that 'socat' processes
are being left unkilled, which then causes BATS to hang when
it (presumably) runs a final 'wait' in its end cleanup.

The two principal changes are to exec socat in a subshell
with fd3 closed, and to pkill its child processes before
killing the process itself. I don't know if both are needed.
The pkill definitely is; the exec may just be superstition.
Since I've wasted more than a day of PTO time on this, I'm
okay with a little superstition. What I do know is that with
these two changes, my reproducer fails to reproduce in over
one hour of trying (normally it fails within 5 minutes).

AND, update: only rawhide (f35) leaves stray socat processes
behind. f33 and ubuntu do not, so 'pkill -P' fails.

I really have no idea what's going on.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2021-03-11 16:21:51 -07:00
1345d0358b system tests: the catch-up game
- run test: minor cleanup to .containerenv test. Basically,
  make it do only two podman-runs (they're expensive) and
  tighten up the results checks

- ps test: add ps -a --storage. Requires small tweak to
  run_podman helper, so we can have "timeout" be an expected
  result

- sdnotify test: workaround for #8718 (seeing MAINPID=xxx as
  last output line instead of READY=1). As found by the
  newly-added debugging echos, what we are seeing is:

      MAINPID=103530
      READY=1
      MAINPID=103530

  It's not supposed to be that way; it's supposed to be just
  the first two. But when faced with reality, we must bend
  to accommodate it, so let's accept READY=1 anywhere in
  the output stream, not just as the last line.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-12-14 15:06:43 -07:00
f5b3dc976c Tests: Fix common flakes, and improve apiv2 test log
- apiv2 - the 'ten /info requests' test is flaking often,
  taking ~8 seconds (our limit is 7, up from 5 a few weeks
  ago). Brent suggested that the first /info call might be
  expensive, because it needs to access storage. So, let's
  prime it by running one /info outside the timing loop.
  And, because even that continues to fail, bump it up
  to 10 seconds and file #8076 to track the slowdown.

- toolbox test - WaitForReady() has timed out, even on one
  occasion causing a run failure because it failed 3 times.
  Solution: bump up timeout from 2s to 5s. Not really great,
  but CI systems are underpowered, and it's not unreasonable
  that 2s might be too low.

- sdnotify test - add a 'podman wait' between stop & rm.
  This may prevent a "cannot rm container as it is running"
  race condition.

While working on this, Brent and I noticed a few ways that
test-apiv2 logging can be improved:

- test name: when request is POST, display the jsonified
  parameters, not the original input ones. This should
  make it much easier to reproduce failures.

- use curl's "--write-out" option to capture http code,
  content type, and request time. We were getting the
  first two via grep from logged headers; this is cleaner.
  And there was no other way to get timing. We now include
  the timing as X-Response-Time in the log file.

- abort on *any* curl error, not just 7 (cannot connect).
  Any error at all from curl is bad news.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-10-20 11:32:49 -06:00
1646da834c System test additions
- run --userns=keep-id: confirm that $HOME gets set (#8013)

 - inspect: confirm that JSON output is a sane number of
   lines (10 or more), not an unreadable one-liner (#8011
   and #8021). Do so with image, pod, network, volume
   because the code paths might be different.

 - cgroups: confirm that 'run' preserves cgroup manager (#7970)

 - sdnotify: reenable tests, and hope CI doesn't hang. This
   test was disabled on August 18 because CI jobs were hanging
   and timing out. My suspicion was that it was #7316, which
   in turn seems to have hinged on conmon #182. The latter
   was merged on Sep 16, so let's cross our fingers and see
   what happens.

Also: remove inaccurate warning from a networking test.

And, wow, fix is_cgroupsv2(), it has never actually worked.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-10-14 15:32:02 -06:00
a9dbd2b3de Migrate away from docker.io
CI and system tests currently pull some images from docker.io.
Eliminate that, by:

  - building a custom image containing much of what we need
    for testing; and
  - copying other needed images to quay.io

(Reason: effective 2020-11-01 docker.io will limit the
number of image pulls).

The principal change is to create a new quay.io/libpod/testimage,
using the new test/system/build-testimage script, instead of
relying on quay.io/libpod/alpine_labels. We also switch to
using a hardcoded :YYYYMMDD tag, instead of :latest, in an
attempt to futureproof our CI. This image includes 'httpd'
from busybox-extras, which we use in our networking test
(previously we had to pull and run busybox from docker.io).

The testimage can and should be extended as needed for future
tests, e.g. adding test file content or other useful tools.

For the '--pull' tests which require actually pulling from
the registry, I've created an image with the same name but
tagged :00000000 so it will never be pulled by default.
Since this image is only used minimally, it's just busybox.

Unfortunately there remain two cases we cannot solve in
this tiny alpine-based image:

  1) docker registry
  2) systemd

For those, I've (manually) run:

    podman pull [ docker.io/library/registry:2.7 | registry.fedoraproject.org/fedora:31 ]
    podman tag !$ quay.io/...
    podman push !$

...and amended the calling tests accordingly.

I've tried to make the the smallest reasonable diff, not the
smallest possible one. I hope it's a reasonable tradeoff.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-09-08 06:06:06 -06:00
d254fa4c35 system tests: enable more remote tests; cleanup
info, images, run, networking tests: remove some skip_if_remote()s
that were added in the varlink days. All of these tests now seem
to work with APIv2.

help test: check that first output line from 'podman --help'
is the program description (regression check for #7273).

load test: clean up stray images, rewrite test to make it conform
to existing convention. In the process, discover and file #7337

exec test (and networking): file #7360, and add FIXME comment
to skip()s suggesting evaluating those tests once that is fixed.

pod test: now that #6328 is fixed, use 'podman pod inspect --format'
instead of relying on jq

Various other tests: add an explanation of why test is disabled
so we can more easily distinguish "this will never be meaningful
under remote" vs "hey, doesn't work for now, but maybe someday".

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-08-19 08:12:14 -06:00
18f36d8cf6 Re-disable sdnotify tests to try to fix CI
Some CI tests are hanging, timing out in 60 or 120 minutes.
I wonder if it's #7316, the bug where all podman commands
hang forever if NOTIFY_SOCKET is set?

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-08-18 07:21:47 -06:00
60ab5f3ae6 system tests: enable sdnotify tests
Oops. PR #6693 (sdnotify) added tests, but they were disabled
due to broken crun on f31. I tried for three weeks to get a
magic CI:IMG PR to update crun on the CI VMs ... but in that
time I forgot to actually enable those new tests.

This PR removes a 'skip', replacing it with a check that systemd
is running plus one more to make sure our runtime is crun. It
looks like sdnotify just doesn't work on Ubuntu (it hangs), and
my guess is that it's a crun/runc issue.

I also changed the test image from fedora:latest to :31, because,
sigh, fedora:latest removed the systemd-notify tool.

WARNING WARNING WARNING: the symptom of a missing systemd-notify
is that podman will hang forever, not even stopped by the timeout
command in podman_run! (Filed: #7316). This means that if the
sdnotify-in-container test ever fails, the symptom will be that
Cirrus itself will time out (2 hours?). This is horrible. I
don't know what to do about it other than push for a fix for 7316.

Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-08-13 19:16:25 -06:00
10ad46eb73 BATS system tests for new sdnotify
Signed-off-by: Ed Santiago <santiago@redhat.com>
2020-07-06 17:47:22 +00:00