From 0e9224f67114f8a61316f2752b4158928faeb39c Mon Sep 17 00:00:00 2001 From: Richard Littauer Date: Fri, 29 Apr 2016 14:05:07 -0400 Subject: [PATCH] Added a debug-guidelines doc Stolen from this comment https://github.com/ipfs/go-ipfs/issues/2594#issuecomment-212506980 License: MIT Signed-off-by: Richard Littauer --- debug-guide.md | 91 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 debug-guide.md diff --git a/debug-guide.md b/debug-guide.md new file mode 100644 index 000000000..239ba1e0a --- /dev/null +++ b/debug-guide.md @@ -0,0 +1,91 @@ +# General performance debugging guidelines + +This is a document for helping debug go-ipfs. Please add to it if you can! + +### Table of Contents +- [Beginning](#beginning) +- [Analysing the stack dump](#analysing-the-stack-dump) +- [Analyzing the CPU Profile](#analyzing-the-cpu-profile) +- [Other](#other) + +### Beginning + +When you see ipfs doing something (using lots of CPU, memory, or otherwise +being weird), the first thing you want to do is gather all the relevant +profiling information. + +- goroutine dump + - `curl localhost:5001/debug/pprof/goroutine\?debug=2 > ipfs.stacks` +- 30 second cpu profile + - `curl localhost:5001/debug/pprof/profile > ipfs.cpuprof` +- heap trace dump + - `curl localhost:5001/debug/pprof/heap > ipfs.stacks` +- system information + - `ipfs diag sys > ipfs.sysinfo` + +Bundle all that up and include a copy of the ipfs binary that you are running +(having the exact same binary is important, it contains debug info). + +You can investigate yourself if you feel intrepid: + +### Analysing the stack dump + +The first thing to look for is hung goroutines -- any goroutine thats been stuck +for over a minute will note that in the trace. It looks something like: + +``` +goroutine 2306090 [semacquire, 458 minutes]: +sync.runtime_Semacquire(0xc8222fd3e4) + /home/whyrusleeping/go/src/runtime/sema.go:47 +0x26 +sync.(*Mutex).Lock(0xc8222fd3e0) + /home/whyrusleeping/go/src/sync/mutex.go:83 +0x1c4 +gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux.(*Session).Close(0xc8222fd340, 0x0, 0x0) + /home/whyrusleeping/gopkg/src/gx/ipfs/QmedFDs1WHcv3bcknfo64dw4mT1112yptW1H65Y2Wc7KTV/yamux/session.go:205 +0x55 +gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux.(*conn).Close(0xc8222fd340, 0x0, 0x0) + /home/whyrusleeping/gopkg/src/gx/ipfs/QmWSJzRkCMJFHYUQZxKwPX8WA7XipaPtfiwMPARP51ymfn/go-stream-muxer/yamux/yamux.go:39 +0x2d +gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).Close(0xc8257a2000, 0x0, 0x0) + /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:156 +0x1f2 +created by gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream.(*Conn).GoClose + /home/whyrusleeping/gopkg/src/gx/ipfs/QmZK81vcgMhpb2t7GNbozk7qzt6Rj4zFqitpvsWT9mduW8/go-peerstream/conn.go:131 +0xab +``` + +At the top, you can see that this goroutine (number 2306090) has been waiting +to acquire a semaphore for 458 minutes. That seems bad. Looking at the rest of +the trace, we see the exact line it's waiting on is line 47 of runtime/sema.go. +That's not particularly helpful, so we move on. Next, we see that call was made +by line 205 of yamux/session.go in the `Close` method of `yamux.Session`. This +one appears to be the issue. + +Given that information, look for another goroutine that might be +holding the semaphore in question in the rest of the stack dump. +(If you need help doing this, ping and we'll stub this out.) + +There are a few different reasons that goroutines can be hung: +- `semacquire` means we're waiting to take a lock or semaphore. +- `select` means that the goroutine is hanging in a select statement and none of the cases are yielding +anything. +- `chan receive` and `chan send` are waiting for a channel to be received from or sent on, respectively. +- `IO wait` generally means that we are waiting on a socket to read or write data, although it *can* mean we are +waiting on a very slow filesystem. + +If you see any of those tags _without_ a `, +X minutes` suffix, that generally means there isn't a problem -- you just caught +that goroutine in the middle of a short wait for something. If the wait time is +over a few minutes, that either means that goroutine doesn't do much, or +something is pretty wrong. + +### Analyzing the CPU Profile + +The go team wrote an [excellent article on profiling go +programs](http://blog.golang.org/profiling-go-programs). If you've already +gathered the above information, you can skip down to where they start talking +about `go tool pprof`. My go-to method of analyzing these is to run the `web` +command, which generates an SVG dotgraph and opens it in your browser. This is +the quickest way to easily point out where the hot spots in the code are. + +### Other + +If you have any questions, or want us to analyze some weird go-ipfs behaviour, +just let us know, and be sure to include all the profiling information +mentioned at the top. +