| Home / Documentation / 1.0 / User Guide / | ||||
| Performance Tuning | ||||
|
Practical mod_perl
By Stas Bekman, Eric Cholet |
||
|
The mod_perl Developer's Cookbook
By Geoffrey Young, Paul Lindner, Randy Kobes |
||
|
mod_perl Pocket Reference
By Andrew Ford |
||
An exhaustive list of various techniques you might want to use to get the most performance possible out of your mod_perl server: configuration, coding, memory use and more.
To make the user's Web browsing experience as painless as possible, every effort must be made to wring the last drop of performance from the server. There are many factors which affect Web site usability, but speed is one of the most important. This applies to any webserver, not just Apache, so it is very important that you understand it.
How do we measure the speed of a server? Since the user (and not the computer) is the one that interacts with the Web site, one good speed measurement is the time elapsed between the moment when she clicks on a link or presses a Submit button to the moment when the resulting page is fully rendered.
The requests and replies are broken into packets. A request may be made up of several packets, a reply may be many thousands. Each packet has to make its own way from one machine to another, perhaps passing through many interconnection nodes. We must measure the time starting from when the first packet of the request leaves our user's machine to when the last packet of the reply arrives back there.
A webserver is only one of the entities the packets see along their way. If we follow them from browser to server and back again, they may travel by different routes through many different entities. Before they are processed by your server the packets might have to go through proxy (accelerator) servers and if the request contains more than one packet, packets might arrive to the server by different routes with different arrival times, therefore it's possible that some packets that arrive earlier will have to wait for other packets before they could be reassembled into a chunk of the request message that will be then read by the server. Then the whole process is repeated in reverse.
You could work hard to fine tune your webserver's performance, but a slow Network Interface Card (NIC) or a slow network connection from your server might defeat it all. That's why it's important to think about the Big Picture and to be aware of possible bottlenecks between the server and the Web.
Of course there is little that you can do if the user has a slow connection. You might tune your scripts and webserver to process incoming requests ultra quickly, so you will need only a small number of working servers, but you might find that the server processes are all busy waiting for slow clients to accept their responses.
But there are techniques to cope with this. For example you can deliver the respond after it was compressed. If you are delivering a pure text respond--gzip compression will sometimes reduce the size of the respond by 10 times.
You should analyze all the involved components when you try to create the best service for your users, and not the web server or the code that the web server executes. A Web service is like a car, if one of the parts or mechanisms is broken the car may not go smoothly and it can even stop dead if pushed too far without first fixing it.
And let me stress it again--if you want to have a success in the web service business you should start worrying about the client's browsing experience and not only how good your code benchmarks are.
Before we try to solve a problem we need to identify it. In our case we want to get the best performance we can with as little monetary and time investment as possible.
(META: Only partial analysis. Please submit more points. Many points are scattered around the document and should be gathered here, to represent the whole picture. It also should be merged with the above item!)
You need to analyze all of the problem's dimensions. There are several things that need to be considered:
How long does it take to process each request?
How many requests can you process simultaneously?
How many simultaneous requests are you planning to get?
At what rate are you expecting to receive requests?
The first one is probably the easiest to optimize. Following the performance optimization tips in this and other documents allows a perl (mod_perl) programmer to exercise their code and improve it.
The second one is a function of RAM. How much RAM is in each box, how many boxes do you have, and how much RAM does each mod_perl process use? Multiply the first two and divide by the third. Ask yourself whether it is better to switch to another, possibly just as inefficient language or whether that will actually cost more than throwing another powerful machine into the rack.
Also ask yourself whether switching to another language will even help. In some applications, for example to link Oracle runtime libraries, a huge chunk of memory is needed so you would save nothing even if you switched from Perl to C.
The last two are important. You need a realistic estimate. Are you really expecting 8 million hits per day? What is the expected peak load, and what kind of response time do you need to guarantee? Remember that these numbers might change drastically when you apply code changes and your site becomes popular. Remember that when you get a very high hit rate, the resource requirements don't grow linearly but exponentially!
More coverage is provided in the section "Choosing Hardware".
In order to improve performance we need measurement tools. The main tool categories are benchmarking and code profiling.
It's important to understand that in a major number of the benchmarking tests that we will execute we will not look at the absolute result numbers but the relation between the two and more result sets, since in most cases we would try to show which coding approach is preferable and the you shouldn't try to compare the absolute results collected while running the same benchmarks on your machine, since you won't have the exact hardware and software setup anyway. So this kind of comparison would be misleading. Compare the relative results from the tests running on your machine, don't compare your absolute results with those in this Guide.
How much faster is mod_perl than mod_cgi (aka plain perl/CGI)? There
are many ways to benchmark the two. I'll present a few examples and
numbers below. Check out the benchmark directory of the mod_perl
distribution for more examples.
If you are going to write your own benchmarking utility, use the
Benchmark module for heavy scripts and the Time::HiRes module
for very fast scripts (faster than 1 sec) where you will need better
time precision.
There is no need to write a special benchmark though. If you want to
impress your boss or colleagues, just take some heavy CGI script you
have (e.g. a script that crunches some data and prints the results to
STDOUT), open 2 xterms and call the same script in mod_perl mode in
one xterm and in mod_cgi mode in the other. You can use lwp-get
from the LWP package to emulate the browser. The benchmark
directory of the mod_perl distribution includes such an example.
See also two tools for benchmarking: ApacheBench and crashme test
If you are going to write your own benchmarking utility, use the
Benchmark module and the Time::HiRes module where you need
better time precision (<10msec).
An example of the Benchmark.pm module usage:
benchmark.pl
------------
use Benchmark;
timethis (1_000,
sub {
my $x = 100;
my $y = log ($x ** 100) for (0..10000);
});
% perl benchmark.pl timethis 1000: 25 wallclock secs (24.93 usr + 0.00 sys = 24.93 CPU)
If you want to get the benchmark results in micro-seconds you will
have to use the Time::HiRes module, its usage is similar to
Benchmark's.
use Time::HiRes qw(gettimeofday tv_interval); my $start_time = [ gettimeofday ]; sub_that_takes_a_teeny_bit_of_time(); my $end_time = [ gettimeofday ]; my $elapsed = tv_interval($start_time,$end_time); print "The sub took $elapsed seconds."
See also the crashme test.
Here are the numbers from Michael Parker's mod_perl presentation at the Perl Conference (Aug, 98). (Sorry, there used to be links here to the source, but they went dead one day, so I removed them). The script is a standard hits counter, but it logs the counts into a mysql relational DataBase:
Benchmark: timing 100 iterations of cgi, perl... [rate 1:28]
cgi: 56 secs ( 0.33 usr 0.28 sys = 0.61 cpu)
perl: 2 secs ( 0.31 usr 0.27 sys = 0.58 cpu)
Benchmark: timing 1000 iterations of cgi,perl... [rate 1:21]
cgi: 567 secs ( 3.27 usr 2.83 sys = 6.10 cpu)
perl: 26 secs ( 3.11 usr 2.53 sys = 5.64 cpu)
Benchmark: timing 10000 iterations of cgi, perl [rate 1:21]
cgi: 6494 secs (34.87 usr 26.68 sys = 61.55 cpu)
perl: 299 secs (32.51 usr 23.98 sys = 56.49 cpu)
We don't know what server configurations were used for these tests, but I guess the numbers speak for themselves.
The source code of the script was available at http://www.realtime.net/~parkerm/perl/conf98/sld006.htm. It's now a dead link. If you know its new location, please let me know.
In the next sections we will talk about tools that allow us to benchmark response times.
ApacheBench (ab) is a tool for benchmarking your Apache HTTP
server. It is designed to give you an idea of the performance that
your current Apache installation can give. In particular, it shows
you how many requests per second your Apache server is capable of
serving. The ab tool comes bundled with the Apache source
distribution.
Let's try it. We will simulate 10 users concurrently requesting a
very light script at www.example.com/perl/test.pl. Each simulated
user makes 10 requests.
% ./ab -n 100 -c 10 www.example.com/perl/test.pl
The results are:
Document Path: /perl/test.pl
Document Length: 319 bytes
Concurrency Level: 10
Time taken for tests: 0.715 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 60700 bytes
HTML transferred: 31900 bytes
Requests per second: 139.86
Transfer rate: 84.90 kb/s received
Connection Times (ms)
min avg max
Connect: 0 0 3
Processing: 13 67 71
Total: 13 67 74
We can see that under load of ten concurrent users our server is
capable of processing 140 requests per second. Of course this
benchmark is correct only when the script under test is used. We can
also learn about the average processing time, which in this case was
67 milli-seconds. Other numbers reported by ab may or may not be of
interest to you.
For example if we believe that the script perl/test.pl is not efficient we will try to improve it and run the benchmark again, to see whether we have any improve in performance.
HTTPD::Bench::ApacheBench, available from CPAN, provides a Perl
interface for ab.
httperf is a utility written by David Mosberger. Just like ApacheBench, it measures the performance of the webserver.
A sample command line is shown below:
httperf --server hostname --port 80 --uri /test.html \ --rate 150 --num-conn 27000 --num-call 1 --timeout 5
This command causes httperf to use the web server on the host with IP name hostname, running at port 80. The web page being retrieved is /test.html and, in this simple test, the same page is retrieved repeatedly. The rate at which requests are issued is 150 per second. The test involves initiating a total of 27,000 TCP connections and on each connection one HTTP call is performed. A call consists of sending a request and receiving a reply.
The timeout option defines the number of seconds that the client is willing to wait to hear back from the server. If this timeout expires, the tool considers the corresponding call to have failed. Note that with a total of 27,000 connections and a rate of 150 per second, the total test duration will be approximately 180 seconds (27,000/150), independently of what load the server can actually sustain. Here is a result that one might get:
Total: connections 27000 requests 26701 replies 26701 test-duration 179.996 s
Connection rate: 150.0 conn/s (6.7 ms/conn, <=47 concurrent connections)
Connection time [ms]: min 1.1 avg 5.0 max 315.0 median 2.5 stddev 13.0
Connection time [ms]: connect 0.3
Request rate: 148.3 req/s (6.7 ms/req)
Request size [B]: 72.0
Reply rate [replies/s]: min 139.8 avg 148.3 max 150.3 stddev 2.7 (36 samples)
Reply time [ms]: response 4.6 transfer 0.0
Reply size [B]: header 222.0 content 1024.0 footer 0.0 (total 1246.0)
Reply status: 1xx=0 2xx=26701 3xx=0 4xx=0 5xx=0
CPU time [s]: user 55.31 system 124.41 (user 30.7% system 69.1% total 99.8%)
Net I/O: 190.9 KB/s (1.6*10^6 bps)
Errors: total 299 client-timo 299 socket-timo 0 connrefused 0 connreset 0
Errors: fd-unavail 0 addrunavail 0 ftab-full 0 other 0
http_load is yet another utility that does webserver load
testing. It can simulate 33.6kbps modem connection (-throttle) and
allows you to provide a file with a list of URLs, which we be fetched
randomly. You can specify how many parallel connections to run using
the -parallel N option, or you can specify the number of requests
to generate per second with -rate N option. Finally you can tell
the utility when to stop by specifying either the test time length
(-seconds N) or the total number of fetches (-fetches N).
A sample run with the file urls including:
http://www.example.com/foo/ http://www.example.com/bar/
We ask to generate three requests per second and run for only two seconds. Here is the generated output:
% ./http_load -rate 3 -seconds 2 urls http://www.example.com/foo/: check-connect SUCCEEDED, ignoring http://www.example.com/bar/: check-connect SUCCEEDED, ignoring http://www.example.com/bar/: check-connect SUCCEEDED, ignoring http://www.example.com/bar/: check-connect SUCCEEDED, ignoring http://www.example.com/foo/: check-connect SUCCEEDED, ignoring 5 fetches, 3 max parallel, 96870 bytes, in 2.00258 seconds 19374 mean bytes/connection 2.49678 fetches/sec, 48372.7 bytes/sec msecs/connect: 1.805 mean, 5.24 max, 0.79 min msecs/first-response: 291.289 mean, 560.338 max, 34.349 min
So you can see that it has reported 2.5 requests per second. Of course for the real test you will want to load the server heavily and run the test for a longer time to get more reliable results.
Note that when you provide a file with a list of URLs make sure that you don't have empty lines in it. If you do -- the utility won't work complaining:
./http_load: unknown protocol -
This is another crashme suite originally written by Michael Schilli (and was located at http://www.linux-magazin.de site, but now the link has gone). I made a few modifications, mostly adding my() operators. I also allowed it to accept more than one url to test, since sometimes you want to test more than one script.
The tool provides the same results as ab above but it also allows you to set the timeout value, so requests will fail if not served within the time out period. You also get values for Latency (seconds per request) and Throughput (requests per second). It can do a complete simulation of your favorite Netscape browser :) and give you a better picture.
I have noticed while running these two benchmarking suites, that ab gave me results from two and a half to three times better. Both suites were run on the same machine, with the same load and the same parameters, but the implementations were different.
Sample output:
URL(s): http://www.example.com/perl/access/access.cgi Total Requests: 100 Parallel Agents: 10 Succeeded: 100 (100.00%) Errors: NONE Total Time: 9.39 secs Throughput: 10.65 Requests/sec Latency: 0.85 secs/Request
And the code:
The LWP::Parallel::UserAgent benchmark: code/lwp-bench.pl
The Apache::Timeit module does PerlHandler Benchmarking. With
the help of this module you can log the time taken to process the
request, just like you'd use the Benchmark module to benchmark a
regular Perl script. Of course you can extend this module to perform
more advanced processing like putting the results into a database for
a later processing. But all it takes is adding this configuration
directive inside httpd.conf:
PerlFixupHandler Apache::Timeit
Since scripts running under Apache::Registry are running inside the
PerlHandler these are benchmarked as well.
An example of the lines which show up in the error_log file:
timing request for /perl/setupenvoff.pl:
0 wallclock secs ( 0.04 usr + 0.01 sys = 0.05 CPU)
timing request for /perl/setupenvoff.pl:
0 wallclock secs ( 0.03 usr + 0.00 sys = 0.03 CPU)
The Apache::Timeit package is a part of the Apache-Perl-contrib
files collection available from CPAN.
Other tools you may want to take a look at:
HTTP::WebTest
HTTP::WebTest module runs tests on remote URLs or local web files
containing Perl/JSP/HTML/JavaScript/etc. and generates a detailed test
report.
It's available from CPAN.
HTTP::Monkeywrench
HTTP::Monkeywrench is a test-harness application to test the
integrity of a user's path through a web site.
It's available from CPAN.
Apache::Recorder and HTTP::RecordedSession
Apache::Recorder is a mod_perl handler that records an HTTP session
and stores it on the web server's file system.
HTTP::RecordedSession reads the recorded session from the file
system, and formats it for playback using HTTP::WebTest or
HTTP::Monkeywrench. This is useful when writing acceptance and
regression tests.
It's available from CPAN.
The profiling process helps you to determine which subroutines or just snippets of code take the longest time to execute and which subroutines are called most often. Probably you will want to optimize those.
When do you need to profile your code? You do that when you suspect that some part of your code is called very often and may be there is a need to optimize it to significantly improve the overall performance.
For example if you have ever used the diagnostics pragma, which
extends the terse diagnostics normally emitted by both the Perl
compiler and the Perl interpreter, augmenting them with the more
verbose and endearing descriptions found in the perldiag manpage.
You know that it might tremendously slow you code down, so let's first
prove that it is correct.
We will run a benchmark, once with diagnostics enabled and once disabled, on a subroutine called test_code.
The code inside the subroutine does an arithmetic and a numeric
comparison of two strings. It assigns one string to another if the
condition tests true but the condition always tests false. To
demonstrate the diagnostics overhead the comparison operator is
intentionally wrong. It should be a string comparison, not a
numeric one.
use Benchmark;
use diagnostics;
use strict;
my $count = 50000;
disable diagnostics;
my $t1 = timeit($count,\&test_code);
enable diagnostics;
my $t2 = timeit($count,\&test_code);
print "Off: ",timestr($t1),"\n";
print "On : ",timestr($t2),"\n";
sub test_code{
my ($a,$b) = qw(foo bar);
my $c;
if ($a == $b) {
$c = $a;
}
}
For only a few lines of code we get:
Off: 1 wallclock secs ( 0.81 usr + 0.00 sys = 0.81 CPU) On : 13 wallclock secs (12.54 usr + 0.01 sys = 12.55 CPU)
With diagnostics enabled, the subroutine test_code() is 16 times
slower, than with diagnostics disabled!
Now let's fix the comparison the way it should be, by replacing ==
with eq, so we get:
my ($a,$b) = qw(foo bar);
my $c;
if ($a eq $b) {
$c = $a;
}
and run the same benchmark again:
Off: 1 wallclock secs ( 0.57 usr + 0.00 sys = 0.57 CPU) On : 1 wallclock secs ( 0.56 usr + 0.00 sys = 0.56 CPU)
Now there is no overhead at all. The diagnostics pragma slows
things down only when warnings are generated.
After we have verified that using the diagnostics pragma might adds
a big overhead to execution runtime, let's use the code profiling to
understand why this happens. We are going to use Devel::DProf to
profile the code. Let's use this code:
diagnostics.pl
--------------
use diagnostics;
print "Content-type:text/html\n\n";
test_code();
sub test_code{
my ($a,$b) = qw(foo bar);
my $c;
if ($a == $b) {
$c = $a;
}
}
Run it with the profiler enabled, and then create the profiling stastics with the help of dprofpp:
% perl -d:DProf diagnostics.pl
% dprofpp
Total Elapsed Time = 0.342236 Seconds
User+System Time = 0.335420 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
92.1 0.309 0.358 1 0.3089 0.3578 main::BEGIN
14.9 0.050 0.039 3161 0.0000 0.0000 diagnostics::unescape
2.98 0.010 0.010 2 0.0050 0.0050 diagnostics::BEGIN
0.00 0.000 -0.000 2 0.0000 - Exporter::import
0.00 0.000 -0.000 2 0.0000 - Exporter::export
0.00 0.000 -0.000 1 0.0000 - Config::BEGIN
0.00 0.000 -0.000 1 0.0000 - Config::TIEHASH
0.00 0.000 -0.000 2 0.0000 - Config::FETCH
0.00 0.000 -0.000 1 0.0000 - diagnostics::import
0.00 0.000 -0.000 1 0.0000 - main::test_code
0.00 0.000 -0.000 2 0.0000 - diagnostics::warn_trap
0.00 0.000 -0.000 2 0.0000 - diagnostics::splainthis
0.00 0.000 -0.000 2 0.0000 - diagnostics::transmo
0.00 0.000 -0.000 2 0.0000 - diagnostics::shorten
0.00 0.000 -0.000 2 0.0000 - diagnostics::autodescribe
It's not easy to see what is responsible for this enormous overhead,
even if main::BEGIN seems to be running most of the time. To get
the full picture we must see the OPs tree, which shows us who calls
whom, so we run:
% dprofpp -T
and the output is:
main::BEGIN
diagnostics::BEGIN
Exporter::import
Exporter::export
diagnostics::BEGIN
Config::BEGIN
Config::TIEHASH
Exporter::import
Exporter::export
Config::FETCH
Config::FETCH
diagnostics::unescape
.....................
3159 times [diagnostics::unescape] snipped
.....................
diagnostics::unescape
diagnostics::import
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
main::test_code
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
diagnostics::warn_trap
diagnostics::splainthis
diagnostics::transmo
diagnostics::shorten
diagnostics::autodescribe
So we see that two executions of diagnostics::BEGIN and 3161 of
diagnostics::unescape are responsible for most of the running
overhead.
If we comment out the diagnostics module, we get:
Total Elapsed Time = 0.079974 Seconds
User+System Time = 0.059974 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c Name
0.00 0.000 -0.000 1 0.0000 - main::test_code
It is possible to profile code running under mod_perl with the
Devel::DProf module, available on CPAN. However, you must have
apache version 1.3b3 or higher and the PerlChildExitHandler enabled
during the httpd build process. When the server is started,
Devel::DProf installs an END block to write the tmon.out
file. This block will be called at server shutdown. Here is how to
start and stop a server with the profiler enabled:
% setenv PERL5OPT -d:DProf % httpd -X -d `pwd` & ... make some requests to the server here ... % kill `cat logs/httpd.pid` % unsetenv PERL5OPT % dprofpp
The Devel::DProf package is a Perl code profiler. It will collect
information on the execution time of a Perl script and of the subs in
that script (remember that print() and map() are just like any
other subroutines you write, but they come bundled with Perl!)
Another approach is to use Apache::DProf, which hooks
Devel::DProf into mod_perl. The Apache::DProf module will run a
Devel::DProf profiler inside each child server and write the
tmon.out file in the directory $ServerRoot/logs/dprof/$$ when
the child is shutdown (where $$ is the number of the child
process). All it takes is to add to httpd.conf:
PerlModule Apache::DProf
Remember that any PerlHandler that was pulled in before
Apache::DProf in the httpd.conf or startup.pl, will not have
its code debugging information inserted. To run dprofpp, chdir to
$ServerRoot/logs/dprof/$$ and run:
% dprofpp
(Lookup the ServerRoot directive's value in httpd.conf to figure
out what's your $ServerRoot.)
Very important aspect of performance tuning is to make sure that your applications don't use much memory, since if they do you cannot run many servers and therefore in most cases under a heavy load the overall performance degrades.
In addition the code may not be clean and leak memory, which is even worse, since if the same process serves many requests and after each request more memory is used, after awhile all RAM will be used and machine will start swapping (use the swap partition) which is a very undesirable event, since it may lead to a machine crash.
The simplest way to figure out how big the processes are and see whether they grow is to watch the output of top(1) or ps(1) utilities.
For example the output of top(1):
8:51am up 66 days, 1:44, 1 user, load average: 1.09, 2.27, 2.61 95 processes: 92 sleeping, 3 running, 0 zombie, 0 stopped CPU states: 54.0% user, 9.4% system, 1.7% nice, 34.7% idle Mem: 387664K av, 309692K used, 77972K free, 111092K shrd, 70944K buff Swap: 128484K av, 11176K used, 117308K free 170824K cached
PID USER PRI NI SIZE RSS SHARE STAT LIB %CPU %MEM TIME COMMAND
29225 nobody 0 0 9760 9760 7132 S 0 12.5 2.5 0:00 httpd_perl
29220 nobody 0 0 9540 9540 7136 S 0 9.0 2.4 0:00 httpd_perl
29215 nobody 1 0 9672 9672 6884 S 0 4.6 2.4 0:01 httpd_perl
29255 root 7 0 1036 1036 824 R 0 3.2 0.2 0:01 top
376 squid 0 0 15920 14M 556 S 0 1.1 3.8 209:12 squid
29227 mysql 5 5 1892 1892 956 S N 0 1.1 0.4 0:00 mysqld
29223 mysql 5 5 1892 1892 956 S N 0 0.9 0.4 0:00 mysqld
29234 mysql 5 5 1892 1892 956 S N 0 0.9 0.4 0:00 mysqld
Which starts with overall information of the system and then displays
the most active processes at the given moment. So for example if we
look at the httpd_perl processes we can see the size of the
resident (RSS) and shared (SHARE) memory segments. This sample
was taken on the production server running linux.
But of course we want to see all the apache/mod_perl processes, and that's where ps(1) comes to help. The options of this utility vary from one Unix flavor to another, and some flavors provide their own tools. Let's check the information about mod_perl processes:
% ps -o pid,user,rss,vsize,%cpu,%mem,ucomm -C httpd_perl
PID USER RSS VSZ %CPU %MEM COMMAND
29213 root 8584 10264 0.0 2.2 httpd_perl
29215 nobody 9740 11316 1.0 2.5 httpd_perl
29216 nobody 9668 11252 0.7 2.4 httpd_perl
29217 nobody 9824 11408 0.6 2.5 httpd_perl
29218 nobody 9712 11292 0.6 2.5 httpd_perl
29219 nobody 8860 10528 0.0 2.2 httpd_perl
29220 nobody 9616 11200 0.5 2.4 httpd_perl
29221 nobody 8860 10528 0.0 2.2 httpd_perl
29222 nobody 8860 10528 0.0 2.2 httpd_perl
29224 nobody 8860 10528 0.0 2.2 httpd_perl
29225 nobody 9760 11340 0.7 2.5 httpd_perl
29235 nobody 9524 11104 0.4 2.4 httpd_perl
Now you can see the resident (RSS) and virtual (VSZ) memory
segments (and shared memory segment if you ask for it) of all mod_perl
processes. Please refer to the top(1) and ps(1) man pages for more
information.
You probably agree that using top(1) and ps(1) is cumbersome if we
want to use memory size sampling during the benchmark test. We want to
have a way to print memory sizes during the program execution at
desired places. If you have GTop modules installed, which is a perl
glue to the libgtop library, it's exactly what we need.
Note: GTop requires the libgtop library but is not available for
all platforms. See the docs in the source at
ftp://ftp.gnome.org/pub/GNOME/stable/sources/gtop/ to check whether
your platform/flavor is supported.
GTop provides an API for retrieval of information about processes
and the whole system. We are interested only in memory sampling API
methods. To print all the process related memory information we can
execute the following code:
use GTop;
my $gtop = GTop->new;
my $proc_mem = $gtop->proc_mem($$);
for (qw(size vsize share rss)) {
printf " %s => %d\n", $_, $proc_mem->$_();
}
When executed we see the following output (in bytes):
size => 1900544
vsize => 3108864
share => 1392640
rss => 1900544
So if we are interested in to print the process resident memory segment before and after some event we just do it: For example if we want to see how much extra memory was allocated after a variable creation we can write the following code:
use GTop; my $gtop = GTop->new; my $before = $gtop->proc_mem($$)->rss; my $x = 'a' x 10000; my $after = $gtop->proc_mem($$)->rss; print "diff: ",$after-$before, " bytes\n";
and the output
diff: 20480 bytes
So we can see that Perl has allocated extra 20480 bytes to create
$x (of course the creation of after needed a few bytes as well,
but it's insignificant compared to a size of $x)
The Apache::VMonitor module with help of the GTop module allows
you to watch all your system information using your favorite browser
from anywhere in the world without a need to telnet to your machine.
If you are looking at what information you can retrieve with GTop,
you should look at Apache::VMonitor as it deploys a big part of
the API GTop provides.
If you are running a true BSD system, you may use
BSD::Resource::getrusage instead of GTop. For example:
print "used memory = ".(BSD::Resource::getrusage)[2]."\n"
For more information refer to the BSD::Resource manpage.
With help of Apache::Status you can find out the size of each
and every subroutine.
Build and install mod_perl as you always do, make sure it's version 1.22 or higher.
Configure /perl-status if you haven't already:
<Location /perl-status>
SetHandler perl-script
PerlHandler Apache::Status
order deny,allow
#deny from all
#allow from ...
</Location>
Add to httpd.conf
PerlSetVar StatusOptionsAll On PerlSetVar StatusTerse On PerlSetVar StatusTerseSize On PerlSetVar StatusTerseSizeMainSummary On
PerlModule B::TerseSize
Start the server (best in httpd -X mode)
From your favorite browser fetch http://localhost/perl-status
Click on 'Loaded Modules' or 'Compiled Registry Scripts'
Click on the module or script of your choice (you might need to run some script/handler before you will see it here unless it was preloaded)
Click on 'Memory Usage' at the bottom
You should see all the subroutines and their respective sizes.
Now you can start to optimize your code. Or test which of the several implementations is of the least size.
For example let's compare CGI.pm's OO vs. procedural interfaces:
As you will see below the first OO script uses about 2k bytes while the second script (procedural interface) uses about 5k.
Here are the code examples and the numbers:
cgi_oo.pl
---------
use CGI ();
my $q = CGI->new;
print $q->header;
print $q->b("Hello");
cgi_mtd.pl
---------
use CGI qw(header b);
print header();
print b("Hello");
After executing each script in single server mode (-X) the results are:
Totals: 1966 bytes | 27 OPs
handler 1514 bytes | 27 OPs exit 116 bytes | 0 OPs
Totals: 4710 bytes | 19 OPs handler 1117 bytes | 19 OPs basefont 120 bytes | 0 OPs frameset 120 bytes | 0 OPs caption 119 bytes | 0 OPs applet 118 bytes | 0 OPs script 118 bytes | 0 OPs ilayer 118 bytes | 0 OPs header 118 bytes | 0 OPs strike 118 bytes | 0 OPs layer 117 bytes | 0 OPs table 117 bytes | 0 OPs frame 117 bytes | 0 OPs style 117 bytes | 0 OPs Param 117 bytes | 0 OPs small 117 bytes | 0 OPs embed 117 bytes | 0 OPs font 116 bytes | 0 OPs span 116 bytes | 0 OPs exit 116 bytes | 0 OPs big 115 bytes | 0 OPs div 115 bytes | 0 OPs sup 115 bytes | 0 OPs Sub 115 bytes | 0 OPs TR 114 bytes | 0 OPs td 114 bytes | 0 OPs Tr 114 bytes | 0 OPs th 114 bytes | 0 OPs b 113 bytes | 0 OPs
Note, that the above is correct if you didn't precompile all
CGI.pm's methods at server startup. Since if you did, the
procedural interface in the second test will take up to 18k and not 5k
as we saw. That's because the whole of CGI.pm's namespace is
inherited and it already has all its methods compiled, so it doesn't
really matter whether you attempt to import only the symbols that you
need. So if you have:
use CGI qw(-compile :all);
in the server startup script. Having:
use CGI qw(header);
or
use CGI qw(:all);
is essentially the same. You will have all the symbols precompiled at
startup imported even if you ask for only one symbol. It
seems to me like a bug, but probably that's how CGI.pm works.
BTW, you can check the number of opcodes in the code by a simple command line run. For example comparing 'my %hash' vs. 'my %hash = ()'.
% perl -MO=Terse -e 'my %hash' | wc -l
-e syntax OK
4
% perl -MO=Terse -e 'my %hash = ()' | wc -l
-e syntax OK
10
The first one has less opcodes.
Note that you shouldn't use Apache::Status module on production
server as it adds quite a bit of overhead for each request.
In order to get the best performance it helps to get intimately familiar with the Operating System (OS) the web server is running on. There are many OS specific things that you may be able to optimize which will improve your web server's speed, reliability and security.
The following sections will reveal some of the most important details you should know about your OS.
The sharing of memory is one very important factor. If your OS supports it (and most sane systems do), you might save memory by sharing it between child processes. This is only possible when you preload code at server startup. However, during a child process' life its memory pages tend to become unshared.
There is no way we can make Perl allocate memory so that (dynamic) variables land on different memory pages from constants, so the copy-on-write effect (we will explain this in a moment) will hit you almost at random.
If you are pre-loading many modules you might be able to trade off the
memory that stays shared against the time for an occasional fork by
tuning MaxRequestsPerChild. Each time a child reaches this upper
limit and dies it should release its unshared pages. The new child
which replaces it will share its fresh pages until it scribbles on
them.
The ideal is a point where your processes usually restart before too
much memory becomes unshared. You should take some measurements to
see if it makes a real difference, and to find the range of reasonable
values. If you have success with this tuning the value of
MaxRequestsPerChild will probably be peculiar to your situation and
may change with changing circumstances.
It is very important to understand that your goal is not to have
MaxRequestsPerChild to be 10000. Having a child serving 300
requests on precompiled code is already a huge overall speedup, so if
it is 100 or 10000 it probably does not really matter if you can save
RAM by using a lower value.
Do not forget that if you preload most of your code at server startup, the newly forked child gets ready very fast, because it inherits most of the preloaded code and the perl interpreter from the parent process.
During the life of the child its memory pages (which aren't really its own to start with, it uses the parent's pages) gradually get `dirty' - variables which were originally inherited and shared are updated or modified -- and the copy-on-write happens. This reduces the number of shared memory pages, thus increasing the memory requirement. Killing the child and spawning a new one allows the new child to get back to the pristine shared memory of the parent process.
The recommendation is that MaxRequestsPerChild should not be too
large, otherwise you lose some of the benefit of sharing memory.
See Choosing MaxRequestsPerChild for more
about tuning the MaxRequestsPerChild parameter.
You've probably noticed that the word shared is repeated many times in relation to mod_perl. Indeed, shared memory might save you a lot of money, since with sharing in place you can run many more servers than without it. See the Formula and the numbers.
How much shared memory do you have? You can see it by either using
the memory utility that comes with your system or you can deploy the
GTop module:
use GTop ();
print "Shared memory of the current process: ",
GTop->new->proc_mem($$)->share,"\n";
print "Total shared memory: ",
GTop->new->mem->share,"\n";
When you watch the output of the top utility, don't confuse the
RES (or RSS) columns with the SHARE column. RES is
RESident memory, which is the size of pages currently swapped in.
I have shown how to measure the size of the process' shared memory, but we still want to know what the real memory usage is. Obviously this cannot be calculated simply by adding up the memory size of each process because that wouldn't account for the shared memory.
On the other hand we cannot just subtract the shared memory size from the total size to get the real memory usage numbers, because in reality each process has a different history of processed requests, therefore the shared memory is not the same for all processes.
So how do we measure the real memory size used by the server we run? It's probably too difficult to give the exact number, but I've found a way to get a fair approximation which was verified in the following way. I have calculated the real memory used, by the technique you will see in the moment, and then have stopped the Apache server and saw that the memory usage report indicated that the total used memory went down by almost the same number I've calculated. Note that some OSs do smart memory pages caching so you may not see the memory usage decrease as soon as it actually happens when you quit the application.
This is a technique I've used:
For each process sum up the difference between shared and system memory. To calculate a difference for a single process use:
use GTop; my $proc_mem = GTop->new->proc_mem($$); my $diff = $proc_mem->size - $proc_mem->share; print "Difference is $diff bytes\n";
Now if we add the shared memory size of the process with maximum shared memory, we will get all the memory that actually is being used by all httpd processes, except for the parent process.
Finally, add the size of the parent process.
Please note that this might be incorrect for your system, so you use this number on your own risk.
I've used this technique to display real memory usage in the module Apache::VMonitor, so instead of trying to manually calculate this number you can use this module to do it automatically. In fact in the calculations used in this module there is no separation between the parent and child processes, they are all counted indifferently using the following code:
use GTop ();
my $gtop = GTop->new;
my $total_real = 0;
my $max_shared = 0;
# @mod_perl_pids is initialized by Apache::Scoreboard, irrelevant here
my @mod_perl_pids = some_code();
for my $pid (@mod_perl_pids)
my $proc_mem = $gtop->proc_mem($pid);
my $size = $proc_mem->size($pid);
my $share = $proc_mem->share($pid);
$total_real += $size - $share;
$max_shared = $share if $max_shared < $share;
}
my $total_real += $max_shared;
So as you see we that we accumulate the difference between the shared and reported memory:
$total_real += $size-$share;
and at the end add the biggest shared process size:
my $total_real += $max_shared;
So now $total_real contains approximately the really used memory.
How do you find out if the code you write is shared between the processes or not? The code should be shared, except where it is on a memory page with variables that change. Some variables are read-only in usage and never change. For example, if you have some variables that use a lot of memory and you want them to be read-only. As you know the variable becomes unshared when the process modifies its value.
So imagine that you have this 10Mb in-memory database that resides in a single variable, you perform various operations on it and want to make sure that the variable is still shared. For example if you do some matching regular expression (regex) processing on this variable and want to use the pos() function, will it make the variable unshared or not?
The Apache::Peek module comes to rescue. Let's write a module
called MyShared.pm which we preload at server startup, so all the
variables of this module are initially shared by all children.
MyShared.pm
---------
package MyShared;
use Apache::Peek;
my $readonly = "Chris";
sub match { $readonly =~ /\w/g; }
sub print_pos{ print "pos: ",pos($readonly),"\n";}
sub dump { Dump($readonly); }
1;
This module declares the package MyShared, loads the
Apache::Peek module and defines the lexically scoped $readonly
variable which is supposed to be a variable of large size (think about
a huge hash data structure), but we will use a small one to simplify
this example.
The module also defines three subroutines: match() that does a simple
character matching, print_pos() that prints the current position of
the matching engine inside the string that was last matched and
finally the dump() subroutine that calls the Apache::Peek module's
Dump() function to dump a raw Perl data-type of the $readonly
variable.
Now we write the script that prints the process ID (PID) and calls all three functions. The goal is to check whether pos() makes the variable dirty and therefore unshared.
share_test.pl ------------- use MyShared; print "Content-type: text/plain\r\n\r\n"; print "PID: $$\n"; MyShared::match(); MyShared::print_pos(); MyShared::dump();
Before you restart the server, in httpd.conf set:
MaxClients 2
for easier tracking. You need at least two servers to compare the print outs of the test program. Having more than two can make the comparison process harder.
Now open two browser windows and issue the request for this script several times in both windows, so you get different processes PIDs reported in the two windows and each process has processed a different number of requests to the share_test.pl script.
In the first window you will see something like that:
PID: 27040
pos: 1
SV = PVMG(0x853db20) at 0x8250e8c
REFCNT = 3
FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x8271af0 "Chris"\0
CUR = 5
LEN = 6
MAGIC = 0x853dd80
MG_VIRTUAL = &vtbl_mglob
MG_TYPE = 'g'
MG_LEN = 1
And in the second window:
PID: 27041
pos: 2
SV = PVMG(0x853db20) at 0x8250e8c
REFCNT = 3
FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x8271af0 "Chris"\0
CUR = 5
LEN = 6
MAGIC = 0x853dd80
MG_VIRTUAL = &vtbl_mglob
MG_TYPE = 'g'
MG_LEN = 2
We see that all the addresses of the supposedly big structure are the
same (0x8250e8c and 0x8271af0), therefore the variable data
structure is almost completely shared. The only difference is in
SV.MAGIC.MG_LEN record, which is not shared.
So given that the $readonly variable is a big one, its value is
still shared between the processes, while part of the variable data
structure is non-shared. But it's almost insignificant because it
takes a very little memory space.
Now if you need to compare more than variable, doing it by hand can be
quite time consuming and error prune. Therefore it's better to
correct the testing script to dump the Perl data-types into files (e.g
/tmp/dump.$$, where $$ is the PID of the process) and then using
diff(1) utility to see whether there is some difference.
So correcting the dump() function to write the info to the file will
do the job. Notice that we use Devel::Peek and not
Apache::Peek. The both are almost the same, but Apache::Peek
prints it output directly to the opened socket so we cannot intercept
and redirect the result to the file. Since Devel::Peek dumps
results to the STDERR stream we can use the old trick of saving away
the default STDERR handler, and open a new filehandler using the
STDERR. In our example when Devel::Peek now prints to STDERR it
actually prints to our file. When we are done, we make sure to restore
the original STDERR filehandler.
So this is the resulting code:
MyShared2.pm
---------
package MyShared2;
use Devel::Peek;
my $readonly = "Chris";
sub match { $readonly =~ /\w/g; }
sub print_pos{ print "pos: ",pos($readonly),"\n";}
sub dump{
my $dump_file = "/tmp/dump.$$";
print "Dumping the data into $dump_file\n";
open OLDERR, ">&STDERR";
open STDERR, ">".$dump_file or die "Can't open $dump_file: $!";
Dump($readonly);
close STDERR ;
open STDERR, ">&OLDERR";
}
1;
When if we modify the code to use the modified module:
share_test2.pl ------------- use MyShared2; print "Content-type: text/plain\r\n\r\n"; print "PID: $$\n"; MyShared2::match(); MyShared2::print_pos(); MyShared2::dump();
And run it as before (with MaxClients 2), two dump files will be created in the directory /tmp. In our test these were created as /tmp/dump.1224 and /tmp/dump.1225. When we run diff(1):
% diff /tmp/dump.1224 /tmp/dump.1225 12c12 < MG_LEN = 1 --- > MG_LEN = 2
We see that the two padlists (of the variable readonly) are
different, as we have observed before when we did a manual comparison.
In fact we if we think about these results again, we get to a
conclusion that there is no need for two processes to find out whether
the variable gets modified (and therefore unshared). It's enough to
check the datastructure before the script was executed and after that.
You can modify the MyShared2 module to dump the padlists into a
different file after each invocation and than to run the diff(1) on
the two files.
If you want to watch whether some lexically scoped (with my())
variables in your Apache::Registry script inside the same process
get changed between invocations you can use the
Apache::RegistryLexInfo module instead. Since it does exactly
this: it makes a snapshot of the padlist before and after the code
execution and shows the difference between the two. This specific
module was written to work with Apache::Registry scripts so it
won't work for loaded modules. Use the technique we have described
above for any type of variables in modules and scripts.
Surely another way of ensuring that a scalar is readonly and therefore
sharable is to either use the constant pragma or readonly
pragma. But then you won't be able to make calls that alter the
variable even a little, like in the example that we just showed,
because it will be a true constant variable and you will get compile
time error if you try this:
MyConstant.pm
-------------
package MyConstant;
use constant readonly => "Chris";
sub match { readonly =~ /\w/g; }
sub print_pos{ print "pos: ",pos(readonly),"\n";}
1;
% perl -c MyConstant.pm
Can't modify constant item in match position at MyConstant.pm line 5, near "readonly)" MyConstant.pm had compilation errors.
However this code is just right:
MyConstant1.pm
-------------
package MyConstant1;
use constant readonly => "Chris";
sub match { readonly =~ /\w/g; }
1;
You can use the PerlRequire and PerlModule directives to load
commonly used modules such as CGI.pm, DBI and etc., when the
server is started. On most systems, server children will be able to
share the code space used by these modules. Just add the following
directives into httpd.conf:
PerlModule CGI PerlModule DBI
But an even better approach is to create a separate startup file (where you code in plain perl) and put there things like:
use DBI (); use Carp ();
Don't forget to prevent importing of the symbols exported by default
by the module you are going to preload, by placing empty parentheses
() after a module's name. Unless you need some of these in the
startup file, which is unlikely. This will save you a few more memory
bits.
Then you require() this startup file in httpd.conf with the
PerlRequire directive, placing it before the rest of the mod_perl
configuration directives:
PerlRequire /path/to/start-up.pl
CGI.pm is a special case. Ordinarily CGI.pm autoloads most of
its functions on an as-needed basis. This speeds up the loading time
by deferring the compilation phase. When you use mod_perl, FastCGI or
another system that uses a persistent Perl interpreter, you will want
to precompile the functions at initialization time. To accomplish
this, call the package function compile() like this:
use CGI ();
CGI->compile(':all');
The arguments to compile() are a list of method names or sets, and
are identical to those accepted by the use() and import()
operators. Note that in most cases you will want to replace ':all'
with the tag names that you actually use in your code, since generally
you only use a subset of them.
Let's conduct a memory usage test to prove that preloading, reduces memory requirements.
In order to have an easy measurement we will use only one child process, therefore we will use this setting:
MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100
We are going to use the Apache::Registry script memuse.pl which
consists of two parts: the first one preloads a bunch of modules (that
most of them aren't going to be used), the second part reports the
memory size and the shared memory size used by the single child
process that we start. and of course it prints the difference between
the two sizes.
memuse.pl --------- use strict; use CGI (); use DB_File (); use LWP::UserAgent (); use Storable (); use DBI (); use GTop ();
my $r = shift;
$r->send_http_header('text/plain');
my $proc_mem = GTop->new->proc_mem($$);
my $size = $proc_mem->size;
my $share = $proc_mem->share;
my $diff = $size - $share;
printf "%10s %10s %10s\n", qw(Size Shared Difference);
printf "%10d %10d %10d (bytes)\n",$size,$share,$diff;
First we restart the server and execute this CGI script when none of the above modules preloaded. Here is the result:
Size Shared Diff 4706304 2134016 2572288 (bytes)
Now we take all the modules:
use strict; use CGI (); use DB_File (); use LWP::UserAgent (); use Storable (); use DBI (); use GTop ();
and copy them into the startup script, so they will get preloaded. The script remains unchanged. We restart the server and execute it again. We get the following.
Size Shared Diff 4710400 3997696 712704 (bytes)
Let's put the two results into one table:
Preloading Size Shared Diff
Yes 4710400 3997696 712704 (bytes)
No 4706304 2134016 2572288 (bytes)
--------------------------------------------
Difference 4096 1863680 -1859584
You can clearly see that when the modules weren't preloaded the shared memory pages size, were about 1864Kb smaller relative to the case where the modules were preloaded.
Assuming that you have had 256M dedicated to the web server, if you didn't preload the modules, you could have:
268435456 = X * 2572288 + 2134016
X = (268435456 - 2134016) / 2572288 = 103
103 servers.
Now let's calculate the same thing with modules preloaded:
268435456 = X * 712704 + 3997696
X = (268435456 - 3997696) / 712704 = 371
You can have almost 4 times more servers!!!
Remember that we have mentioned before that memory pages gets dirty and the size of the shared memory gets smaller with time? So we have presented the ideal case where the shared memory stays intact. Therefore the real numbers will be a little bit different, but not far from the numbers in our example.
Also it's obvious that in your case it's possible that the process size will be bigger and the shared memory will be smaller, since you will use different modules and a different code, so you won't get this fantastic ratio, but this example is certainly helps to feel the difference.
What happens if you find yourself stuck with Perl CGI scripts and you
cannot or don't want to move most of the stuff into modules to benefit
from modules preloading, so the code will be shared by the children.
Luckily you can preload scripts as well. This time the
Apache::RegistryLoader modules comes to aid.
Apache::RegistryLoader compiles Apache::Registry scripts at
server startup.
For example to preload the script /perl/test.pl which is in fact the file /home/httpd/perl/test.pl you would do the following:
use Apache::RegistryLoader ();
Apache::RegistryLoader->new->handler("/perl/test.pl",
"/home/httpd/perl/test.pl");
You should put this code either into <Perl> sections or
into a startup script.
But what if you have a bunch of scripts located under the same
directory and you don't want to list them one by one. Take the
benefit of Perl modules and put them to a good use. The File::Find
module will do most of the work for you.
The following code walks the directory tree under which all
Apache::Registry scripts are located. For each encountered file
with extension .pl, it calls the
Apache::RegistryLoader::handler() method to preload the script in
the parent server, before pre-forking the child processes:
use File::Find qw(finddepth);
use Apache::RegistryLoader ();
{
my $scripts_root_dir = "/home/httpd/perl/";
my $rl = Apache::RegistryLoader->new;
finddepth
(
sub {
return unless /\.pl$/;
my $url = "$File::Find::dir/$_";
$url =~ s|$scripts_root_dir/?|/|;
warn "pre-loading $url\n";
# preload $url
my $status = $rl->handler($url);
unless($status == 200) {
warn "pre-load of `$url' failed, status=$status\n";
}
},
$scripts_root_dir);
}
Note that we didn't use the second argument to handler() here, as
in the first example. To make the loader smarter about the URI to
filename translation, you might need to provide a trans() function
to translate the URI to filename. URI to filename translation
normally doesn't happen until HTTP request time, so the module is
forced to roll its own translation. If filename is omitted and a
trans() function was not defined, the loader will try using the URI
relative to ServerRoot.
A simple trans() function can be something like that:
sub mytrans {
my $uri = shift;
$uri =~ s|^/perl/|/home/httpd/perl/|;
return $uri;
}
You can easily derive the right translation by looking at the Alias
directive. The above mytrans() function is matching our Alias:
Alias /perl/ /home/httpd/perl/
After defining the URI to filename translation function you should
pass it during the creation of the Apache::RegistryLoader object:
my $rl = Apache::RegistryLoader->new(trans => \&mytrans);
I won't show any benchmarks here, since the effect is absolutely the same as with preloading modules.
See also BEGIN blocks
We have just learned that it's important to preload the modules and scripts at the server startup. It turns out that it's not enough for some modules and you have to prerun their initialization code to get more memory pages shared. Basically you will find an information about specific modules in their respective manpages. We will present a few examples of widely used modules where the code can be initialized.
The first example is the DBI module. As you know DBI works with
many database drivers falling into the DBD:: category,
e.g. DBD::mysql. It's not enough to preload DBI, you should
initialize DBI with driver(s) that you are going to use (usually a
single driver is used), if you want to minimize memory use after
forking the child processes. Note that you want to do this under
mod_perl and other environments where the shared memory is very
important. Otherwise you shouldn't initialize drivers.
You probably know already that under mod_perl you should use the
Apache::DBI module to get the connection persistence, unless you
open a separate connection for each user--in this case you should not
use this module. Apache::DBI automatically loads DBI and
overrides some of its methods, so you should continue coding like
there is only a DBI module.
Just as with modules preloading our goal is to find the startup environment that will lead to the smallest "difference" between the shared and normal memory reported, therefore a smaller total memory usage.
And again in order to have an easy measurement we will use only one child process, therefore we will use this setting in httpd.conf:
MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100
We always preload these modules:
use Gtop(); use Apache::DBI(); # preloads DBI as well
We are going to run memory benchmarks on five different versions of the startup.pl file.
Leave the file unmodified.
Install MySQL driver (we will use MySQL RDBMS for our test):
DBI->install_driver("mysql");
It's safe to use this method, since just like with use(), if it
can't be installed it'll die().
Preload MySQL driver module:
use DBD::mysql;
Tell Apache::DBI to connect to the database when the child process
starts (ChildInitHandler), no driver is preload before the child
gets spawned!
Apache::DBI->connect_on_init('DBI:mysql:test::localhost',
"",
"",
{
PrintError => 1, # warn() on errors
RaiseError => 0, # don't die on error
AutoCommit => 1, # commit executes
# immediately
}
)
or die "Cannot connect to database: $DBI::errstr";
Options 2 and 4: using connect_on_init() and install_driver().
Here is the Apache::Registry test script that we have used:
preload_dbi.pl
--------------
use strict;
use GTop ();
use DBI ();
my $dbh = DBI->connect("DBI:mysql:test::localhost",
"",
"",
{
PrintError => 1, # warn() on errors
RaiseError => 0, # don't die on error
AutoCommit => 1, # commit executes
# immediately
}
)
or die "Cannot connect to database: $DBI::errstr";
my $r = shift;
$r->send_http_header('text/plain');
my $do_sql = "show tables";
my $sth = $dbh->prepare($do_sql);
$sth->execute();
my @data = ();
while (my @row = $sth->fetchrow_array){
push @data, @row;
}
print "Data: @data\n";
$dbh->disconnect(); # NOP under Apache::DBI
my $proc_mem = GTop->new->proc_mem($$);
my $size = $proc_mem->size;
my $share = $proc_mem->share;
my $diff = $size - $share;
printf "%8s %8s %8s\n", qw(Size Shared Diff);
printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;
The script opens a opens a connection to the database 'test' and
issues a query to learn what tables the databases has. When the data
is collected and printed the connection would be closed in the regular
case, but Apache::DBI overrides it with empty method. When the
data is processed a familiar to you already code to print the memory
usage follows.
The server was restarted before each new test.
So here are the results of the five tests that were conducted, sorted by the Diff column:
After the first request:
Test type Size Shared Diff -------------------------------------------------------------- install_driver (2) 3465216 2621440 843776 install_driver & connect_on_init (5) 3461120 2609152 851968 preload driver (3) 3465216 2605056 860160 nothing added (1) 3461120 2494464 966656 connect_on_init (4) 3461120 2482176 978944
After the second request (all the subsequent request showed the same results):
Test type Size Shared Diff -------------------------------------------------------------- install_driver (2) 3469312 2609152 860160 install_driver & connect_on_init (5) 3481600 2605056 876544 preload driver (3) 3469312 2588672 880640 nothing added (1) 3477504 2482176 995328 connect_on_init (4) 3481600 2469888 1011712
Now what do we conclude from looking at these numbers. First we see that only after a second reload we get the final memory footprint for a specific request in question (if you pass different arguments the memory usage might and will be different).
But both tables show the same pattern of memory usage. We can clearly see that the real winner is the startup.pl file's version where the MySQL driver was installed (2). Since we want to have a connection ready for the first request made to the freshly spawned child process, we generally use the version (5) which uses somewhat more memory, but has almost the same number of shared memory pages. The version (3) only preloads the driver which results in smaller shared memory. The last two versions having nothing initialized (1) and having only the connect_on_init() method used (4). The former is a little bit better than the latter, but both significantly worse than the first two versions.
To remind you why do we look for the smallest value in the column diff, recall the real memory usage formula:
RAM_dedicated_to_mod_perl = diff * number_of_processes
+ the_processes_with_largest_shared_memory
Notice that the smaller the diff is, the bigger the number of processes you can have using the same amount of RAM. Therefore every 100K difference counts, when you multiply it by the number of processes. If we take the number from the version (2) vs. (4) and assume that we have 256M of memory dedicated to mod_perl processes we will get the following numbers using the formula derived from the above formula:
RAM - largest_shared_size
N_of Procs = -------------------------
Diff
268435456 - 2609152
(ver 2) N = ------------------- = 309
860160
268435456 - 2469888
(ver 4) N = ------------------- = 262
1011712
So you can tell the difference (17% more child processes in the first version).
CGI.pm is a big module that by default postpones the compilation of
its methods until they are actually needed, thus making it possible to
use it under a slow mod_cgi handler without adding a big
overhead. That's not what we want under mod_perl and if you use
CGI.pm you should precompile the methods that you are going to use
at the server startup in addition to preloading the module. Use the
compile method for that:
use CGI;
CGI->compile(':all');
where you should replace the tag group :all with the real tags and
group tags that you are going to use if you want to optimize the
memory usage.
We are going to compare the shared memory foot print by using the
script which is back compatible with mod_cgi. You will see that you
can improve performance of this kind of scripts as well, but if you
really want a fast code think about porting it to use
Apache::Request for CGI interface and some other module for HTML
generation.
So here is the Apache::Registry script that we are going to use to
make the comparison:
preload_cgi_pm.pl ----------------- use strict; use CGI (); use GTop ();
my $q = new CGI;
print $q->header('text/plain');
print join "\n", map {"$_ => ".$q->param($_) } $q->param;
print "\n";
my $proc_mem = GTop->new->proc_mem($$);
my $size = $proc_mem->size;
my $share = $proc_mem->share;
my $diff = $size - $share;
printf "%8s %8s %8s\n", qw(Size Shared Diff);
printf "%8d %8d %8d (bytes)\n",$size,$share,$diff;
The script initializes the CGI object, sends HTTP header and then
print all the arguments and values that were passed to the script if
at all. At the end as usual we print the memory usage.
As usual we are going to use a single child process, therefore we will use this setting in httpd.conf:
MinSpareServers 1 MaxSpareServers 1 StartServers 1 MaxClients 1 MaxRequestsPerChild 100
We are going to run memory benchmarks on three different versions of the startup.pl file. We always preload this module:
use Gtop();
Leave the file unmodified.
Preload CGI.pm:
use CGI ();
Preload CGI.pm and pre-compile the methods that we are going to use
in the script:
use CGI (); CGI->compile(qw(header param));
The server was restarted before each new test.
So here are the results of the five tests that were conducted, sorted by the Diff column:
After the first request:
Version Size Shared Diff Test type
--------------------------------------------------------------------
1 3321856 2146304 1175552 not preloaded
2 3321856 2326528 995328 preloaded
3 3244032 2465792 778240 preloaded & methods+compiled
After the second request (all the subsequent request showed the same results):
Version Size Shared Diff Test type
--------------------------------------------------------------------
1 3325952 2134016 1191936 not preloaded
2 3325952 2314240 1011712 preloaded
3 3248128 2445312 802816 preloaded & methods+compiled
The first version shows the results of the script execution when
CGI.pm wasn't preloaded. The second version with module
preloaded. The third when it's both preloaded and the methods that are
going to be used are precompiled at the server startup.
By looking at the version one of the second table we can conclude
that, preloading adds about 20K of shared size. As we have mention at
the beginning of this section that's how CGI.pm was implemented--to
reduce the load overhead. Which means that preloading CGI is almost
hardly change a thing. But if we compare the second and the third
versions we will see a very significant difference of 207K
(1011712-802816), and we have used only a few methods (the header
method loads a few more method transparently for a user). Imagine how
much memory we are going to save if we are going to precompile all the
methods that we are using in other scripts that use CGI.pm and do a
little bit more than the script that we have used in the test.
But even in our very simple case using the same formula, what do we see? (assuming that we have 256MB dedicated for mod_perl)
RAM - largest_shared_size
N_of Procs = -------------------------
Diff
268435456 - 2134016
(ver 1) N = ------------------- = 223
1191936
268435456 - 2445312
(ver 3) N = ------------------- = 331
802816
If we preload CGI.pm and precompile a few methods that we use in
the test script, we can have 50% more child processes than when we
don't preload and precompile the methods that we are going to use.
META: I've heard that the 3.x generation will be less bloated, so probably I'll have to rerun this using the new version.
mergemem is an experimental utility for linux, which looks very
interesting for us mod_perl users:
http://www.complang.tuwien.ac.at/ulrich/mergemem/
It looks like it could be run periodically on your server to find and merge duplicate pages. It won't halt your httpds during the merge, this aspect has been taken into consideration already during the design of mergemem: Merging is not performed with one big systemcall. Instead most operation is in userspace, making a lot of small systemcalls.
Therefore blocking of the system should not happen. And, if it really should turn out to take too much time you can reduce the priority of the process.
The worst case that can happen is this: mergemem merges two pages
and immediately afterwards they will be split. The split costs about
the same as the time consumed by merging.