Commit Graph

1409 Commits

Author SHA1 Message Date
Matthew Heon
ebacfbd091 podman: fix memleak caused by renaming and not deleting
the exit file

If the container exit code needs to be retained, it cannot be retained
in tmpfs, because libpod runs in a memcg itself so it can't leave
traces with a daemon-less design.

This wasn't a memleak detectable by kmemleak for example. The kernel
never lost track of the memory and there was no erroneous refcounting
either. The reference count dependencies however are not easy to track
because when a refcount is increased, there's no way to tell who's
still holding the reference. In this case it was a single page of
tmpfs pagecache holding a refcount that kept pinned a whole hierarchy
of dying memcg, slab kmem, cgropups, unrechable kernfs nodes and the
respective dentries and inodes. Such a problem wouldn't happen if the
exit file was stored in a regular filesystem because the pagecache
could be reclaimed in such case under memory pressure. The tmpfs page
can be swapped out, but that's not enough to release the memcg with
CONFIG_MEMCG_SWAP_ENABLED=y.

No amount of more aggressive kernel slab shrinking could have solved
this. Not even assigning slab kmem of dying cgroups to alive cgroup
would fully solve this. The only way to free the memory of a dying
cgroup when a struct page still references it, would be to loop over
all "struct page" in the kernel to find which one is associated with
the dying cgroup which is a O(N) operation (where N is the number of
pages and can reach billions). Linking all the tmpfs pages to the
memcg would cost less during memcg offlining, but it would waste lots
of memory and CPU globally. So this can't be optimized in the kernel.

A cronjob running this command can act as workaround and will allow
all slab cache to be released, not just the single tmpfs pages.

    rm -f /run/libpod/exits/*

This patch solved the memleak with a reproducer, booting with
cgroup.memory=nokmem and with selinux disabled. The reason memcg kmem
and selinux were disabled for testing of this fix, is because kmem
greatly decreases the kernel effectiveness in reusing partial slab
objects. cgroup.memory=nokmem is strongly recommended at least for
workstation usage. selinux needs to be further analyzed because it
causes further slab allocations.

The upstream podman commit used for testing is
1fe2965e4f (v1.4.4).

The upstream kernel commit used for testing is
f16fea666898dbdd7812ce94068c76da3e3fcf1e (v5.2-rc6).

Reported-by: Michele Baldessari <michele@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>

<Applied with small tweaks to comments>
Signed-off-by: Matthew Heon <matthew.heon@pm.me>
2019-07-31 17:28:42 -04:00
OpenShift Merge Robot
680a383874 Merge pull request #3672 from petejohanson/32bit-build-fixes
Build fix for 32-bit systems.
2019-07-30 22:07:32 +02:00
Pete Johanson
32aaf8da56 Build fix for 32-bit systems.
* Fixes #3664.

Signed-off-by: Pete Johanson <peter@peterjohanson.com>
2019-07-30 12:25:36 -04:00
OpenShift Merge Robot
1a008958d4 Merge pull request #3661 from openSUSE/nixos-friendly-config
Update libpod.conf to be more friendly to NixOS
2019-07-30 16:33:48 +02:00
Sascha Grunert
52ae51c79f Update libpod.conf to be NixOS friendly
NixOS links the current system state to `/run/current-system`, so we
have to add these paths to the configuration files as well to work out
of the box.

Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2019-07-30 12:59:11 +02:00
OpenShift Merge Robot
7d635ac1c5 Merge pull request #3656 from jwhonce/wip/env
Fix commit --changes env=X=Y
2019-07-29 21:57:08 +02:00
OpenShift Merge Robot
6665269ab8 Merge pull request #3233 from wking/fatal-requested-hook-directory-does-not-exist
libpod/container_internal: Make all errors loading explicitly configured hook dirs fatal
2019-07-29 16:39:08 +02:00
Jhon Honce
40bf0649af Fix commit --changes env=X=Y
Signed-off-by: Jhon Honce <jhonce@redhat.com>
2019-07-26 16:04:17 -07:00
OpenShift Merge Robot
0c4dfcfe57 Merge pull request #3639 from giuseppe/user-ns-container
podman: support --userns=ns|container
2019-07-26 15:06:06 +02:00
Giuseppe Scrivano
1d72f651e4 podman: support --userns=ns|container
allow to join the user namespace of another container.

Closes: https://github.com/containers/libpod/issues/3629

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-07-25 23:04:55 +02:00
Sascha Grunert
7630f1b52e Fix possible runtime panic if image history len is zero
We now return an empty string for the `Comment` field if an OCI v1 image
contains no history.

Signed-off-by: Sascha Grunert <sgrunert@suse.com>
2019-07-25 12:45:08 +02:00
Matthew Heon
f747a06d53 When retrieving volumes, only use exact names
We should not be fuzzy matching on volume names. Docker doesn't
do  it, and it doesn't make much sense. Everything requires exact
matches for names - only IDs allow partial matches.

Fixes #3635

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
2019-07-24 22:30:16 -04:00
OpenShift Merge Robot
2283471f8d Merge pull request #3626 from mheon/fix_ps_segfault
Fix a segfault on Podman no-store commands with refresh
2019-07-24 14:45:01 +02:00
Peter Hunt
01a8483a59 refactor to reduce duplicated error parsing
Signed-off-by: Peter Hunt <pehunt@redhat.com>
2019-07-23 16:49:04 -04:00
Matthew Heon
5fb4feb36a Fix a segfault on Podman no-store commands with refresh
When a command (like `ps`) requests no store be created, but also
requires a refresh be performed, we have to ignore its request
and initialize the store anyways to prevent segfaults. This work
was done in #3532, but that missed one thing - initializing a
storage service. Without the storage service, Podman will still
segfault. Fix that oversight here.

Fixes #3625

Signed-off-by: Matthew Heon <mheon@redhat.com>
2019-07-23 13:30:30 -04:00
Peter Hunt
479eeac62c move editing of exitCode to runtime
There's no way to get the error if we successfully get an exit code (as it's just printed to stderr instead).
instead of relying on the error to be passed to podman, and edit based on the error code, process it on the varlink side instead

Also move error codes to define package

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2019-07-23 13:29:33 -04:00
baude
a793bccae6 golangci-lint cleanup
a PR slipped through without running the new linter.  this cleans things
up for the master branch.

Signed-off-by: baude <bbaude@redhat.com>
2019-07-23 10:13:04 -05:00
OpenShift Merge Robot
26749204d5 Merge pull request #3621 from baude/golangcilint4
golangci-lint phase 4
2019-07-23 10:21:41 +02:00
baude
0c3038d4b5 golangci-lint phase 4
clean up some final linter issues and add a make target for
golangci-lint. in addition, begin running the tests are part of the
gating tasks in cirrus ci.

we cannot fully shift over to the new linter until we fix the image on
the openshift side.  for short term, we will use both

Signed-off-by: baude <bbaude@redhat.com>
2019-07-22 15:44:04 -05:00
Peter Hunt
a1a79c08b7 Implement conmon exec
This includes:
	Implement exec -i and fix some typos in description of -i docs
	pass failed runtime status to caller
	Add resize handling for a terminal connection
	Customize exec systemd-cgroup slice
	fix healthcheck
	fix top
	add --detach-keys
	Implement podman-remote exec (jhonce)
	* Cleanup some orphaned code (jhonce)
	adapt remote exec for conmon exec (pehunt)
	Fix healthcheck and exec to match docs
		Introduce two new OCIRuntime errors to more comprehensively describe situations in which the runtime can error
		Use these different errors in branching for exit code in healthcheck and exec
	Set conmon to use new api version

Signed-off-by: Jhon Honce <jhonce@redhat.com>

Signed-off-by: Peter Hunt <pehunt@redhat.com>
2019-07-22 15:57:23 -04:00
baude
db826d5d75 golangci-lint round #3
this is the third round of preparing to use the golangci-lint on our
code base.

Signed-off-by: baude <bbaude@redhat.com>
2019-07-21 14:22:39 -05:00
Daniel J Walsh
20302cb65d Cleanup Pull Message
Currently the pull message on failure is UGLY.  This patch removes a lot of the noice
when pulling an image from multiple registries to make the user experience better.

Our current messages are way too verbose and need to be dampened down.  Still has
verbose mode if you turn on log-level=debug.

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-07-20 06:08:22 -04:00
Daniel J Walsh
8ae97b2f57 Add support for listing read/only and read/write images
When removing --all images prune images only attempt to remove read/write images,
ignore read/only images

Signed-off-by: Daniel J Walsh <dwalsh@redhat.com>
2019-07-19 06:59:49 -04:00
OpenShift Merge Robot
deb087d7b1 Merge pull request #3443 from adrianreber/rootfs-changes-migration
Include changes to the container's root file-system in the checkpoint archive
2019-07-19 02:38:26 +02:00
OpenShift Merge Robot
22e62e8691 Merge pull request #3595 from mheon/fix_exec_leak
Remove exec PID files after use to prevent memory leaks
2019-07-18 15:52:57 +02:00
Matthew Heon
5bbede9d9f Remove exec PID files after use to prevent memory leaks
We have another patch running to do the same for exit files, with
a much more in-depth explanation of why it's necessary. Suffice
to say that persistent files in tmpfs tied to container CGroups
lead to significant memory allocations that last for the lifetime
of the file.

Based on a patch by Andrea Arcangeli (aarcange@redhat.com).

Signed-off-by: Matthew Heon <mheon@redhat.com>
2019-07-18 09:06:11 -04:00
Matthew Heon
c91bc31570 Populate inspect with security-opt settings
We can infer no-new-privileges. For now, manually populate
seccomp (can't infer what file we sourced from) and
SELinux/Apparmor (hard to tell if they're enabled or not).

Signed-off-by: Matthew Heon <mheon@redhat.com>
2019-07-17 16:48:38 -04:00
Matthew Heon
156b6ef222 Properly retrieve Conmon PID
Our previous method (just read the PID that we spawned) doesn't
work - Conmon double-forks to daemonize, so we end up with a PID
pointing to the first process, which dies almost immediately.

Reading from the PID file gets us the real PID.

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
2019-07-17 16:48:38 -04:00
Matthew Heon
1e3e99f2fe Move the HostConfig portion of Inspect inside libpod
When we first began writing Podman, we ran into a major issue
when implementing Inspect. Libpod deliberately does not tie its
internal data structures to Docker, and stores most information
about containers encoded within the OCI spec. However, Podman
must present a CLI compatible with Docker, which means it must
expose all the information in 'docker inspect' - most of which is
not contained in the OCI spec or libpod's Config struct.

Our solution at the time was the create artifact. We JSON'd the
complete CreateConfig (a parsed form of the CLI arguments to
'podman run') and stored it with the container, restoring it when
we needed to run commands that required the extra info.

Over the past month, I've been looking more at Inspect, and
refactored large portions of it into Libpod - generating them
from what we know about the OCI config and libpod's (now much
expanded, versus previously) container configuration. This path
comes close to completing the process, moving the last part of
inspect into libpod and removing the need for the create
artifact.

This improves libpod's compatability with non-Podman containers.
We no longer require an arbitrarily-formatted JSON blob to be
present to run inspect.

Fixes: #3500

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
2019-07-17 16:48:38 -04:00
Stefan Becker
5ed2de158f healthcheck: reject empty commands
An image with "HEALTHCHECK CMD ['']" is valid but as there is no command
defined the healthcheck will fail. Reject such a configuration.

Fixes #3507

Signed-off-by: Stefan Becker <chemobejk@gmail.com>
2019-07-16 07:01:43 +03:00
Stefan Becker
dd0ea08cef healthcheck: improve command list parser
- remove duplicate check, already called in HealthCheck()
- reject zero-length command list and empty command string as errorneous
- support all Docker command list keywords: NONE, CMD or CMD-SHELL
- use Docker default "/bin/sh -c" for CMD-SHELL

Fixes #3507

Signed-off-by: Stefan Becker <chemobejk@gmail.com>
2019-07-16 07:01:43 +03:00
OpenShift Merge Robot
547cb4e55e Merge pull request #3532 from mheon/ensure_store_on_refresh
Ensure we have a valid store when we refresh
2019-07-15 21:26:16 +02:00
dom finn
ee76ba5e68 Improves STD output/readability in combination
with debug output.

Added \n char to specific standard output

Signed-off-by: dom finn <dom.finn00@gmail.com>
2019-07-14 16:03:49 +10:00
OpenShift Merge Robot
20f11718de Merge pull request #3558 from mheon/fix_pod_remove
Fix a bug where ctrs could not be removed from pods
2019-07-11 21:35:53 +02:00
OpenShift Merge Robot
d614372c2f Merge pull request #3552 from baude/golangcilint2
golangci-lint pass number 2
2019-07-11 21:35:45 +02:00
Matthew Heon
8713483362 Fix a bug where ctrs could not be removed from pods
Using pod removal worked, but container removal was missing the
most critical step - the actual removal. Must have been
accidentally removed during a refactor.

Fixes #3556

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
2019-07-11 10:17:33 -04:00
baude
a78c885397 golangci-lint pass number 2
clean up and prepare to migrate to the golangci-linter

Signed-off-by: baude <bbaude@redhat.com>
2019-07-11 09:13:06 -05:00
Adrian Reber
05549e8b29 Add --ignore-rootfs option for checkpoint/restore
The newly added functionality to include the container's root
file-system changes into the checkpoint archive can now be explicitly
disabled. Either during checkpoint or during restore.

If a container changes a lot of files during its runtime it might be
more effective to migrated the root file-system changes in some other
way and to not needlessly increase the size of the checkpoint archive.

If a checkpoint archive does not contain the root file-system changes
information it will automatically be skipped. If the root file-system
changes are part of the checkpoint archive it is also possible to tell
Podman to ignore these changes.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-07-11 14:43:35 +02:00
Adrian Reber
1a32074884 Fix typo in checkpoint/restore related texts
Signed-off-by: Adrian Reber <areber@redhat.com>
2019-07-11 14:43:35 +02:00
Adrian Reber
217f2e77f8 Include root file-system changes in container migration
One of the last limitations when migrating a container using Podman's
'podman container checkpoint --export=/path/to/archive.tar.gz' was
that it was necessary to manually handle changes to the container's root
file-system. The recommendation was to mount everything as --tmpfs where
the root file-system was changed.

This extends the checkpoint export functionality to also include all
changes to the root file-system in the checkpoint archive. The
checkpoint archive now includes a tarstream of the result from 'podman
diff'. This tarstream will be applied to the restored container before
restoring the container.

With this any container can now be migrated, even it there are changes
to the root file-system.

There was some discussion before implementing this to base the root
file-system migration on 'podman commit', but it seemed wrong to do
a 'podman commit' before the migration as that would change the parent
layer the restored container is referencing. Probably not really a
problem, but it would have meant that a migrated container will always
reference another storage top layer than it used to reference during
initial creation.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-07-11 14:43:34 +02:00
Adrian Reber
d5f1caaf50 Add function to get a filtered tarstream diff
The newly added function GetDiffTarStream() mirrors the GetDiff()
function. It tries to get the correct layer ID from getLayerID()
and it filters out containerMounts from the tarstream. Thus the
behavior is the same as GetDiff(), but it returns a tarstream.

This also adds the function ApplyDiffTarStream() to apply the tarstream
generated by GetDiffTarStream().

These functions are targeted to support container migration with
root file-system changes.

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-07-11 14:43:34 +02:00
OpenShift Merge Robot
144567b42d Merge pull request #3527 from adrianreber/finish
Correctly set FinishedTime for checkpointed container
2019-07-11 10:23:19 +02:00
Adrian Reber
f187bab497 Correctly set FinishedTime for checkpointed container
During 'podman container checkpoint' the finished time was not set. This
resulted in a strange container status after checkpointing:

 Exited (0) 292 years ago

During checkpointing FinishedTime is now set to time.now().

Signed-off-by: Adrian Reber <areber@redhat.com>
2019-07-11 07:35:38 +02:00
OpenShift Merge Robot
e2e8477f83 Merge pull request #3521 from baude/golangcilint1
first pass of corrections for golangci-lint
2019-07-11 01:22:30 +02:00
baude
e053e0e05e first pass of corrections for golangci-lint
Signed-off-by: baude <bbaude@redhat.com>
2019-07-10 15:52:17 -05:00
Giuseppe Scrivano
18c4d73867 runtime: drop spurious message log
fix a regression introduced by 1d36501f96

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-07-10 15:47:38 +02:00
Matthew Heon
5ef972d87b Ensure we have a valid store when we refresh
Fixes #3520

Signed-off-by: Matthew Heon <matthew.heon@pm.me>
2019-07-10 08:55:48 -04:00
OpenShift Merge Robot
76aa8f6d2d Merge pull request #3529 from giuseppe/healthcheck-rootless
healthcheck: support rootless mode
2019-07-09 16:09:37 +02:00
Giuseppe Scrivano
c6c637da00 healthcheck: support rootless mode
now that dbus authentication works fine from a user namespace (systemd
241 works fine), we can enable rootless healthchecks.

It uses "systemd-run --user" for creating the healthcheck timer and
communicates with the user instance of systemd listening at
$XDG_RUNTIME_DIR/systemd/private.

Closes: https://github.com/containers/libpod/issues/3523

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
2019-07-09 14:20:20 +02:00
OpenShift Merge Robot
fce2e6577e Merge pull request #3497 from QazerLab/bugfix/systemd-generate-pidfile
Use conmon pidfile in generated systemd unit as PIDFile.
2019-07-08 23:39:42 +02:00