![]() |
|
Ken McDonell <kenj@kenj.id.au>
September 2018
In this document the generic term "component" will be used to describe a piece of functionality in Performance Co-Pilot (PCP) that could be any of the following:
Obviously the functional goals and feature focus of PCP has been to build tools that help understand hard performance problems that are seen in some the the world's most complex systems. Over more than two decades, the delivery of components to address the feature requirements has been build on engineering processes and approaches with a technical focus on:
From the outset we were concerned with making PCP work successfully across a wide variety of environments.
The earliest PCP architectural deliberations began at Pyramid Technology in 1993, but by this time I had already been working on portable software for more than 15 years. Then the PCP incubator moved to SGI, and in the very early days of Linux, Mark Goodwin undertook an "IRIX to Linux" port as a skunk works project within SGI Engineering. A little later, SGI's Clustered File System (CXFS) had PCP components for IRIX, Windows, HPUX, AIX, MacOS and Linux, across a variety of CPU architectures, C compilers and operating systems.
Although many of these have passed away, PCP survives and is now actively supported on all Linux variants (most importantly the Fedora/RPM, OpenSuSE/RPM and Debian/dpkg platforms and their respective derivatives), the BSD family, OpenIndiana (nee Sun Solaris), Mac OS X and Windows.
So there is a very rich history of engineering for portability in the PCP DNA, and an expectation that new PCP components will be "robust" (see below for a definition of "robust") across all of the supported platforms and environments.
The PCP project has never had a dedicated QA team, so by necessity the QA model that was adopted (and at times enforced) was one where if an engineer added a feature or fixed a bug, there was an expectation that there would be additional QA coverage for the associated changes. Although this approach suffers from tunnel-vision and common assumptions between development and testing, it does have the advantage that testing and QA is a communal responsibility.
This effort grew into the very large QA infrastructure that lives below the qa directory in the source tree and is shipped in the pcp-testsuite package. This represents a significant engineering effort in its own right as the table below shows:
below the src dir | below the qa dir | |
---|---|---|
C or C++ | 320,000 | 37,000 |
shell | 23,000 | 85,000 |
perl | 18,000 | 700 |
python | 17,000 | 200 |
The QA test suite comprises close to 1,100 test scripts. The scripts are used by individual developers to test code changes and check for regressions. In addition the entire suite of scripts is run regularly across the 30+ machines in the Melbourne QA Farm.
Robustness simply means that every PCP application and service either works correctly or detects that the environment will not support correct execution and is either omitted from the build or omitted from the packaging or reports a warning and exits in an orderly fashion.
Some examples may help illustrate this philosophy.
Mostly what's been done here is common and good engineering practice. For example using configure, conditionals in the GNUmakefiles and assorted sed/awk rewriting scripts to ensure the code will compile cleanly on all platforms. Compiler warnings are enabled and builds are expected to complete without warnings. And in the source code we demand thorough error checking and error handling on all system and library calls.
We've extended the normal concept of macros to include a set of globally defined names that are are used for path and directory names, search paths, application names and application options and arguments. These are defined in src/include/pcp.conf.in, bound to values at build time by configure and installed in /etc/pcp.conf. These can then be used in shell scripts and applications in C, C++, Perl, Python have run-time access to these via pmGetConfig() or pmGetOptionalConfig(), see for example src/pmie/src/pmie.c.
Even file pathname separators (/ for the sane world, \ elsewhere) have been abstracted out and pmPathSeparator() is used to construct pathnames from directory components.
At a higher level we don't even try to build code if it is not intended for the target platform.
At packaging time we use conditionals to include only those components that can be built and are expected to work for the target platform.
This extends to wrapping some of the prerequisites in conditionals if the prerequisite piece may not be present or may have a different name.
For Debian packaging this means debian/control is build from debian/control.master and the ?{foo} elements of the Build-Depends and Depends clauses are conditionally expanded by the debian/fixcontrol.master script during the build.
For RPM packaging this means using configure to perform macro substitution to create build/rpm/pcp.spec from build/rpm/pcp.spec.in and using %if within the spec file to selectively include packages and adjust the BuildRequires and Requires clauses.
There are purpose-designed QA applications in the qa/src directory and the source code for these applications should follow all of the same guidelines for portability and build resilience as those that apply for the applications and libraries that ship with the main part of PCP. The only possible exception is that error handling can be a little more relaxed for the QA applications as they are used in a more controlled manner.
All QA test scripts will have the variables set in /etc/pcp.conf placed in the environment using $PCP_DIR/etc/pcp.env which is called from common.rc (with $PCP_DIR typically unset). And all QA tests source common.rc, although the sequence is a little convoluted (and irrelevant for this discussion).
The important thing is that the following environment variables become available to improve the portability of QA test scripts and the output filtering (see below) they must perform.
$ cd $PCP_PMDAS_DIR/mypmda
$ cd /var/lib/pcp/pmdas/mypmda
$ echo -n "Prompt? "
$ echo "Prompt? \c"
$ $PCP_ECHO_PROG $PCP_ECHO_N "Prompt? $PCP_ECHO_C"
PCP QA scripts follow a standard template when created with the qa/new script (which is the recommended way to create new QA test scripts) and then the following local shell variables are available:
$ ... >$tmp.out 2>$tmp.err
$ mkdir $tmp
There is a a very large set of applications and packages outside of PCP that are required to build PCP from source and/or run PCP QA. The script qa/admin/check-vm tries to capture all of these requirements. check-vm should be used as follows:
Our QA tests scripts are run with sh not bash and on some platforms these are really different!
Things like an == operand
used with
test (aka [) as in
$ if [ "$foo" == "bar" ] ...
will not work with the Bourne shell.
Instead, use the = operator, e.g.
$ sh
$ x=1
$ [ "$x" = 1 ] && echo yippee
yippee
$ [ "$x" == 1 ] && echo boo
sh: 3: [: 1: unexpected operator
Also and any use of the bash [[ operator
$ ... [[ ... ]]
is going to blow up when presented to a real Bourne shell.
In most cases test (aka [) can be used for a
straight forward rewrite.
Even less portable is any use of the bash $((...))
construct for in-built arithmetic.
For these cases, you'll need equivalent logic using expr, e.g.
instead of
x=$(($x+1))
use
x=`expr $x + 1`
Another recurring one is the -i argument for sed;
this is not standard and not supported everywhere so just do not use
it. The alternative:
$ sed ... <somefile >$tmp.tmp; mv $tmp.tmp somefile
works everywhere for most of the cases in QA.
If cross-filesystem linking and a lame mv are in play then the
following is even more portable:
$ sed ... <somefile >$tmp.tmp; cp $tmp.tmp somefile; rm $tmp.tmp
If permissions are in play, then you may need:
$ sed ... <somefile >$tmp.tmp; $sudo cp $tmp.tmp somefile; rm $tmp.tmp
Do not use seq as it is not portable.
For example, this is not portable:
for i in $(seq 4); do ...; done
but it can be rewritten as follows and this will work everywhere:
i=1; while [ $i -le 4 ]; do ...; i=`expr $i + 1`; done
Each QA test script must produce a standard output stream that is deterministic. The qa/check script that is responsible for running each QA test script (say NNNN) captures the standard output from NNNN and if this matches the expected result in NNNN.out then then test is deemed to have passed, otherwise the standard output from NNNN is saved in NNNN.out.bad and the test is deemed to have failed.
So it is critical that output of NNNN is deterministic across all platforms and timezones.
Applications used in a QA test script will potentially produce output that is irrelevant to the success or otherwise of the test. The most obvious example is the current date from an application run on the test system. Consequently, most QA test scripts include one or more _filter() functions that are responsible to translating the raw output from the applications run in the test into the deterministic output.
These filtering functions in turn often use a set of standard functions in qa/common.filter that may be applied to the output of common PCP commands and operations.
There are way too many filtering cases to describe them all here, but the list below is illustrative of the range of techniques that may be needed. And each QA test script may need employ more than one of the techniques to produce deterministic output for "correct" execution across all platforms.
To the extent that filtering removes or rewrites what would otherwise be the output of a QA test script, there maybe some difficulty in triaging test failures if you have only the expected NNNN.out and the observed NNNN.out.bad files. Most QA test scripts will emit unfiltered output and diagnostics to aid triage and these are written to a NNNN.full file which is retained for inspection after the test has been run.
If QA test script NNNN decides it cannot or should not be run on the current platform, it should create a NNNN.notrun file. Optionally NNNN.notrun contains the reason the test has not been run.
The convenience function _notrun() defined in qa/common.check may be used to create the file, write the reason to the file and exit.
Some common reasons for a test to be "not run" are:
If a QA test script uses an application that traverses the PMNS for a non-leaf metric name (usually from the command line or a configuration file) then depending on the PMDA(s) involved, there is a chance that the names in the PMNS may be processed in different orders on different platforms.
The simplest solution is to enumerate the PMNS nodes of interest in the QA script using pminfo, sort the list and then have the application being tested operate on the leaf nodes of the PMNS one at a time. qa/647 illustrates an example of this approach.
Some PMDAs maintain their instance domains in a manner that may present instances in non-deterministic order, so while all instances may be present on all platforms, the sequence of the instances within a instance domain or in a pmResult is not the same everywhere.
The QA helper application $here/src/sortinst (source in qa/src/sortinst.c may be use to re-order the reported instances so the sequence is deterministic. See qa/1180 for an example that uses $here/src/sortinst to re-order pminfo -f output.
Unfortunately sort does not produce the same sorted order on all platforms by default. We need to take control of LC_COLLATE and the decision has been made to standardize on POSIX as the collating sequence for PCP QA.
$LC_COLLATE is set and exported into the environment in common.rc, but this is a relatively recent change so LC_COLLATE=POSIX is liberally sprinkled throughout the QA test suite ahead of running sort.
Some PMDAs use readdir() or similar routines to scan a directory contents to find metrics and/or instances. This practice is certain to expose platform differences as the order of directory entries is unlikely to be deterministic.
Judicious use of sort will be required. See qa/496 for a typical example of how this should be done.
For some PMDAs and/or some QA test scripts, no amount of clever engineering will hide the fact that a "correct" test execution on one platform may produce different output to "correct" test execution on another platform. In these cases the simplest choice is to not have a single NNNN.out file, but rather have a set of them and the QA test script chooses the most appropriate one and links this to NNNN.out when it starts.
Alternate NNNN.out files may be required in the following types of cases: