Heap dump
From Resin 3.0
If an application gobbles up memory until it throws an OutOfMemory
exception, or seems to be spending an excessive amount of time doing garbage
collection, a heap dump can help track down the source of the problem.
What you really need to do is get a good CPU and heap profile. The JDK has a simple one (but not very user-friendly), so you don't absolutely need to buy a profiler.
It's a standard java argument, so "java -Xrunhprof:help" will tell you how to start it. For example, you could start Resin as
unix$ bin/httpd.sh -J-Xrunhprof:heap=sites,cpu=samples win> httpd.exe -J-Xrunhprof:heap=sites,cpu=samples
Run a load against it for a decent amount of time (you can even use something like Apache's "ab" for 10 minutes), and then stop the server nicely. i.e. you can't just kill it with Ctrl-C, but you need to make it do a clean exit.
That will dump a java.hprof.txt. Skip to the bottom to get the trace.
Understanding java.hprof.txt heap
Assuming you're being cheap, and using the JDK's heap profiling instead of buying a profiler, you'll need a bit of help interpreting it.
The following is an example of a heap dump running Resin. Here's what you might see skipping down to the "SITES BEGIN" section. (I did mention this is the cheap way of doing things.) As with most profiling, you only want to pay attention to the top 20. Everything else is in the noise. Ignore it.
SITES BEGIN (ordered by live bytes) Tue Jan 9 17:44:33 2001 percent live alloc'ed stack class rank self accum bytes objs bytes objs trace name 1 11.87% 11.87% 983520 120 1393320 170 2553 [B 2 9.89% 21.76% 819600 100 1286772 157 4070 [B 3 9.09% 30.85% 753756 23 3539376 108 4970 [L<Unknown>; 4 5.83% 36.68% 483564 59 778620 95 7180 [B 5 5.74% 42.42% 475368 58 745836 91 7178 [B 6 4.35% 46.77% 360624 44 696660 85 7182 [B 7 2.97% 49.74% 245880 30 450780 55 7176 [B 8 2.37% 52.11% 196704 24 352428 43 7254 [B 9 1.88% 53.99% 155724 19 262272 32 7174 [B 10 1.78% 55.77% 147528 18 245880 30 7137 [B 11 1.53% 57.30% 126988 1063 16973092 129113 3271 [C 12 1.34% 58.64% 110684 3953 20362832 727244 1213 sun/io/CharToByteISO8859_1 13 1.25% 59.88% 103320 738 141820 1013 5942 java/lang/Class 14 1.21% 61.10% 100548 49 221616 108 5003 [L<Unknown>; 15 1.21% 62.31% 100548 49 221616 108 5005 [L<Unknown>; 16 1.07% 63.38% 89080 1532 18393580 317347 1340 [B 17 0.79% 64.18% 65568 8 81960 10 8408 [B 18 0.79% 64.97% 65552 4 65552 4 27630 [C 19 0.70% 65.67% 58232 24 1110128 386 5038 [C 20 0.68% 66.35% 56200 450 116816 980 7186 [C
There are two things to look for. First, if any one class starts chewing up a large number in the "live objs" column, you'll need to take a look. That might be a memory leak.
Second, if some class has a huge number in the "alloc'ed objs" column, you may be wasting lots of garbage collection time that could be easily solved with some caching.
The [C in the class name means a character array. To see what that really means, you'll need to look at the stack trace (3271):
TRACE 3271: java/lang/String.<init>(String.java:244) com/caucho/util/CharBuffer.close(CharBuffer.java:714) com/caucho/vfs/FilesystemPath.normalizePath(FilesystemPath.java:162) com/caucho/vfs/FilesystemPath.schemeWalk(FilesystemPath.java:127)
That's part of Resin's VFS code. Maybe in the future it'll make sense to try to reduce that.
You can get a longer trace using the "depth=" argument:
unix$ bin/httpd.sh -J-Xrunhprof:heap=sites,cpu=samples,depth=10 win> httpd.exe -J-Xrunhprof:heap=sites,cpu=samples,depth=10
Understanding java.hprof.txt CPU
The CPU profiling is a little easier to read. On some JDK's you'll need to run it with the JIT disabled.
CPU SAMPLES BEGIN (total = 424614) Tue Jan 9 17:44:33 2001 rank self accum count trace method 1 21.36% 21.36% 90704 7266 com/caucho/server/http/VirtualHost.logAccess 2 10.84% 32.20% 46041 7269 java/net/SocketInputStream.socketRead 3 5.99% 38.19% 25428 1213 java/lang/Class.newInstance0 4 5.11% 43.31% 21715 7896 com/caucho/util/CharBuffer.toString 5 4.82% 48.13% 20463 1286 sun/io/CharToByteISO8859_1.convert 6 3.54% 51.66% 15018 1242 sun/io/CharToByteConverter.<init> 7 2.68% 54.35% 11388 7241 java/io/PrintWriter.<init> 8 2.47% 56.82% 10508 7748 com/caucho/server/http/Request.fillCookies 9 2.27% 59.09% 9650 1214 sun/io/ByteToCharConverter.<init> 10 1.85% 60.94% 7857 5097 java/lang/String.<init> 11 1.59% 62.53% 6754 1341 java/lang/String.substring 12 1.57% 64.10% 6650 1340 java/lang/String.getBytes 13 0.92% 65.02% 3907 7897 java/lang/String.<init> 14 0.76% 65.78% 3227 3259 com/caucho/vfs/FilePath.fsWalk 15 0.75% 66.53% 3195 7895 com/caucho/server/http/Request.fillCookie 16 0.71% 67.25% 3031 7321 java/lang/String.getBytes 17 0.71% 67.95% 2996 3270 com/caucho/util/CharBuffer.close 18 0.68% 68.63% 2892 3271 java/lang/String.<init> 19 0.66% 69.29% 2782 7318 com/caucho/vfs/FilePath.openWriteImpl 20 0.61% 69.90% 2604 7320 java/io/FileOutputStream.<init>
You should only pay attention to the top 20 or so. You'll probably need to ignore a few of the top 10, because they're just waiting for a user response. The SocketInputStream.socketRead is an example.
You can use the trace number to get a call trace:
TRACE 7266: com/caucho/server/http/VirtualHost.logAccess(VirtualHost.java:487) com/caucho/server/http/Application.logAccess(Application.java:1846) com/caucho/server/http/Response.finish(Response.java:1345) com/caucho/server/http/Request.finish(Request.java:416)