perl for newbies ================ Introduction ------------ The idea of this is to explain the basics of programming in perl. I'll try to be general and focus on the ideas instead of jus the details, and from time to time compare them to how things are done in shell script or in C. The end goal is more to be able to read, understand and modify perl scripts, and eventually write them on your own. No programming skills are assumed, but I won't go explaining what a file is or what a pipe is - a basic understanding of Unix is quite necessary before one tries to program for it. There is a lot of stuff in here, and it's quite condensed; it's probably a good idea to read through it without trying to remember anything but the basics at first, to have a general idea of what one can do, and then look more in detail at the examples to understand precisely how they work. Also, this is nowhere near complete; it's a selected subset of perl. perl has many more commands than are explained here, and quite a few commands have more options or possibilities than I intend to present. There are probably also inacuracies in the detailed semantics of some operations. In any case, the definitive reference are the man pages. 1. What is perl? ---------------- First of all, a bit of philosophy. Perl is often called a scripting language, as opposed to a programming language, because it was not originally intended for large programming works, but for automating quick tasks effectively. Perl has grown more and more powerful over the years, and has been used for large projects, but it is quite obvious that it is not built for that. For one, perl never properly compiles your code into native machine-code, like C compilers do. Every time you execute a perl script, it reads it into memory, semi-compiles it into some internal tokenized form, and then interprets that. This works quite well, and is much faster than interpreting it line by line decoding the words every time (like shells do with shell scripts), but still considerably slower than C. There are two major 'flavors' of perl, perl 4 and perl 5, its successor. perl 5 is mostly backwards-compatible with perl 4, with only one noticeable exception, and also has many additional features and concepts. I will only deal with perl 4, except where perl 5 isn't compatible with it. 2. How to call it ----------------- Since each perl script needs to be interpreted by the 'perl' program itself, and perl scripts are nothing else than text files, they need to start with the characters #! followed by the complete path to the perl binary on the system, typically: #!/usr/local/bin/perl or #!/usr/bin/perl This, once the file is executable, makes it so that when you execute it the kernel will read that line and fetch the perl interpreter and feed it the script transparently. The alternatives, to start up a perl script, is either to explicitly call perl with the name of the script as an argument, like: perl blah.pl or to give the script entirely in the command-line with the -e switch, such as: perl -e 'chmod 0755, "blah"' perl recognizes quite a few possible options; the most useful, other than -e to introduce the script at the command-line, are: -l --- do automatic line-feed translation, adding a line feed at the end of every 'print', and chopping line feeds at the end of every read -v --- prints the version of the perl interpreter and exits -w --- prints warnings about the script, tracking dubious situations like variables that are being used without having been assigned a value. very useful to find out what's wrong with a script all of these are documented in "man perl" in a perl 4 installation, and in "man perlrun" in a perl 5 installation (perl 5 splits the huge manpage into ~25 different sections). Inside a perl script, the character # introduces a comment, so the rest of the line is ignored. 3. Variables and data types --------------------------- Within a perl script, temporary data is held in variables; those behave pretty much like environment variables (which are the only kind of variables available in shell scripts), but are not the same. The basic type of a variable, in perl, is called the _scalar_. A scaler is a string or a number, and you never need to worry about converting between one and the other (as you do in C). You can read a variable from the keyboard (which will naturally be a string), and then perform arithmetic operations on it, and it will work assuming the user entered a valid number. In C, as with most compiled languages, you would have to explicitly convert the string to a number. Also, in perl all scalars can be integers or floating-point numbers, and perl takes care of using the integer representation if possible, for speed. Each variable has a name, which always starts with a special character to indicate the type, and then a name made of at least one letter, and then letters, numbers and/or underscores. All variable names are case sensitive (and lower case is normally used). The special character for scalar variables is $ and is used both when assigning to the variable and when using its value. ex: $blah = 3; # assigns 3 to the scalar variable $blah $blah = "3"; # assigns the string "3", which is equivalent $urf = $blah + 1; # now $urf is 4 $blah = "hello world!\n"; # the \n is a line feed print $blah; # prints "hello world!" in shell, a $ is used only when fetching the value, but not when assigning: ex: blah=3 echo $blah In the script, strings are entered either between "" or between '', with the difference that between "" variables will be interpolated, and \'s will be taken to have a special meaning, introducing special characters. Within ''s, everything it taken literally. Whenever there might be an ambiguity as to how long the name of a variable is, $variable can be written as ${variable}. ex: $blah = "eeek!"; $blahblah = "urf"; print "$blahblah\n"; # prints "urf" print "${blah}blah\n"; # prints "eeek!blah" Strings can contain newline characters simply by having the opening and closing quote on different lines, whether they are ""-strings or ''-strings. ex: $blah = "eek!"; $message = "$blah you scared me"; # now contains "eek! you scared me" $bogus = '$blah you scared me'; # now contains "$blah you scared me" in perl, as with in shell (and unlike in C), variables need not be declared in any way before they are used; the first use of a variable creates it. reading a non-existing value returns an empty string which is interpreted as 0 in a numerical context. Variables the name of which (after the $) starts with a special character that is not a letter are 'reserved' for perl, and a list of them is given in the manpages (man perlvar in a perl 5 installation). Some of them are meant to be assigned to, and will change some specific behavior of perl according to their value, while others are meant to be read and will hold results from previous operations. The only "special" scalar variable that is really useful is $_, which is left free for the user to keep temporary data in, and has the property that many functions will operate on $_ if no argument is given. ex: $_ = "blah blah blah french fries\n"; print; # will print the contents of $_ Note that not *all* functions use $_ by default, so you can't assume it. But it works for most of the functions where it would actually make sense, so in case of doubt, it's worth testing it with a 1-liner to see if it actually works, or looking here or in the man page for the corresponding function. Another important type of data is the array; an array is a numbered (and possibly empty) list of scalars. The special character @ identifies arrays, in the same way $ identifies scalars. All arrays are numbered from 0: the first element has the number 0, the second has the number 1, and so on. An array can be explicitly entered in the form of a comma-separated list of values between parentheses. The n-th element of the array @array is referenced as $array[n]. Note the $ instead of @, because we're referencing just one scalar, rather than the whole array. $plef = 1; # sets a variable @thearray = ("blah", 34, "urf", $plef); # sets the array; note # that one of the elements # is set to the value of # $plef $second = $thearray[1]; # takes the second element # (numbered 1) of @thearray # and copies its value into # $second, which is now 34. @eek = @thearray[0, 2]; # takes a list of elements from @thearray # and assigns them to the array @eek; here # we're taking the 1st and the 3rd element # (numbered 0 and 2) # now @eek is ("blah", "urf") @urka = (@thearray, @eek); # this is *not* an array of arrays, as # arrays are always arrays of scalars. # the what this does is set @urka to # contain all the elements of @thearray # and then all those of @eek, so @urka # is now # ("blah", 34, "urf", 1, "blah", "urf") There are functions that operate on arrays, adding elements to either end, or taking them off, or taking those elements that verify a condition, or sorting them... The size (number of elements minus 1) of an array called @array is given by $#array. For empty (non-existing) arrays, this will evaluate to -1. ex: @array = ("blah", "blah", "blah", "french", "fries"); print $#array, "\n"; # will print 4 (Note that this specific occurrance of # in $#array does not introduce a comment; perl is smart enough to see it's actually being used in the language itself.) We've seen that inside a ""-string, variables are interpolated, so that "blah$eek" actually means "blahblah" if $eek is "blah". This applies also to @arrays, which means that inside ""-strings, the character '@' will be considered special. You usually don't want to interpolate @arrays inside strings, because they come out with all their elements pasted together without anything to separate them, so in practice you always want to put a \ before any @ in a ""-string. ex: print "my email address is root@home\n"; <- does *NOT* work print "my email address is root\@home\n"; <- works Alternatively, you can use ''-strings, in which @ is not interpreted in a special way, but neither is \n so if you want to include a newline they're not any better. print 'e@a', "\n"; <- prints "e@a" ## you can skip this bit at first The third (and last, as far as we're concerned) type of data in perl 4 is the _hash_ (also called associative array). These are denoted by a % and are able to hold scalars indexed by other scalars. The best way to think of them is like dictionaries: each associative array holds values (the definitions on a dictionary), assigned to identifiers (the words), and the basic operations are adding or removing words or changing their value, and checking whether a word exists in the dictionary or not, and what its value is. This is a very powerful data structure, but not really complicated to use. The way to reference a whole hash is %hashname, and the way to reference the element indexed by the word "word" is $hashname{"word"}. Hashes can be assigned direct values the same way arrays are, with a list of values between ()'s, and the scalars will be taken to be alternatively the index and the value. ex: %blah = ("eek", 34, "urf", 18, "zer0", "n0thing"); # note that in this previous line, # all these strings could have been # written between ''s instead of ""s print $blah{"eek"}, "\n"; # prints 34 (followed by a newline) $s = 'zer0'; @onx = ($blah{"urf"}, $blah{$s}); # now @onx is (18, "n0thing") # note that the value inside the {}'s # can be a scalar variable itself, # in this case $s, so $blah{$s} # evaluates to $blah{'zer0'} and # then to "n0thing" The three types of variables (scalars, arrays, hashes) have separated namespaces, which means that you can have a scalar called $blah and an array called @blah, and they won't interact with each other in any way just because of the similarity in names. ex: $blah = 1234; %blah{"onx"} = "blurf"; @blah = (18, "eeeek!", 1.5); print $blah, " ", $blah[1], $blah{'onx'}; # prints "1234 eeeek!onx" ## end of skippable section 4. Some useful buitin functions ------------------------------- Builtin functions are all of perl's "commands", which take arguments, do something to them, and/or to the system, and return values. Many functions don't require that you put their arguments between ()'s, but in complicated expressions ()'s will often be necessary to make it clear what applies to what and which bits should be evaluated together. So typically we avoid ()'s in simple lines like print $blah;, but include them when there is more than one function being called in the same statement. Perl statements are always terminated by a ;, by the way. Functions: * chmod -- Takes a number and a list of file names, and sets the permissions on all the files to those specified by the number. You'll usually want to specify the number in octal, which is the same representation numeric chmods use; you do that in perl by prepending a 0 to the number. Returns a "true" value (anything else than 0 or '') if it succeeded, and a "false" value (0 or '') if not. ex: chmod 0644, "README", "THIS.SITE.SUCKS"; * chop -- Takes a scalar as its argument, and chops the last character of it, returning it as the value of the "chop". Most often used to take the \n from the end of lines read from files or from the keyboard. ex: $a = "takes a scalar"; $b = chop($a); # now $a is "takes a scala" and # $b is "r" In most cases we don't want the returned value, so we just do chop $a; or even just chop; which operates on $_. * die -- Takes a string and exits immediately with that string as an error message. ex: if ($blah != 18) { die "it wasn't 18\n"; } * exit -- Takes a number, and exits with that number as the return code; you typically do exit 0; at any point the program is successfully done, and things like exit 1; # or other non-0 values when you want to exit "unsuccessfully" but still 'normally' (die is more used for "this should never happen" situations, while exit will just exit; it's quite a bit a matter of convention) * join -- Takes a string and then a list of values or an array or more arrays, and produces a string obtained by pasting all the values together, using the first string as a separator. ex: $bah = join(":", "eeek", "nothing", 34); # $bah is now "eek:nothing:34" @array = ("urf", "squeak", "foo"); print (join(" ", @array, "eeek"), "\n"); # prints "urf squeak foo eek" # note that the parentheses are needed # to tell perl that the , before "\n" # separates the 2 arguments to print, # rather than "\n" being yet another # argument to 'join'. * keys -- Takes an associative array by name, and returns a list of all the 'keys' in it. ex: $blah{'eek'} = "*pat*"; $blah{'urf'} = "erp"; @thekeys = keys(%blah); # @thekeys is ("eek", "urf") print (join(" ", keys(%blah)), "\n"); # prints "eek urf" * length -- Takes a string and returns its length in characters. ex: $n = "blip"; $l = $length($n); # $l is now 4 print (length(34), "\n"); # prints 2, the length of the string # representation of the number 34 * mkdir -- Makes a directory, with the first argument specifying the directory name, and the second the starting permissions on it. ex: mkdir "files", 0755; or better (more checking) if (!mkdir("files", 0755)) { die "failed to create directory\n"; } * pop -- Takes an array, and returns the last element of the array, removing it from the array. If the array is empty, returns '' (actually, undef). * print -- Takes a list of arguments, and prints them one after another to the script's standard output. If the first argument to print is *not* followed by a ',', it is assumed to be the name of a "file descriptor" on which the expression should be printed rather than using stdout. More on file descriptors later. If you're just printing a list of scalars, then you typically don't need ()'s and it's much more readable so. If you include any calls to functions (such as 'join') as arguments to print, it is a good idea to fully parenthesize the expression to make it clear. ex: print "Hello, world\n"; $blah = 3; print "The value of 'blah' is $blah\n"; # prints "the value of 'blah' is 3" print ("the keys are:", join(" ", keys %somehash), "\n"); print STDERR "Error reading from bit bucket - brush your teeth\n"; # sends this message to the program's standard error # note no comma after STDERR * push -- Takes the name of an array, and then scalars and/or arrays, and adds them to the end of the first array. ex: @blah = ("a"); $onx = "xterm"; push(@blah, "eeek", $onx, "nothing", 12.5); # @blah is now ("a", "eeek", "xterm", "nothing", 12.5) Most often you just use it to push one value to an array. * rmdir -- Takes the name of a directory and removes it. Returns a "true" or "false" value as usual. ex: rmdir "temp_dir"; * shift -- Takes the name of an array, and returns the first element of it, removing it from the array. It is, in a way, the opposite of push: you add things to an array with push() and then remove them all with shift(), and you're treating them in the same order they came. You can also see it as the opposite of unshift(); if you add things with unshift() and take them out with shift(), you always deal with the most recent first. If the array is empty, leaves it empty and returns a "false" value (more precisely, an 'undef' value). ex: @blah = ("a", "xterm", 12.5); $v = shift @blah; # @blah is now ("xterm", 12.5) and $v is "a" * sleep -- Takes a number of seconds and waits those seconds. * split -- Takes a pattern and a string, and splits it into an array; more on this later, after the regular expressions. * undef -- Takes the name of a variable (of any type) and undefines it. Without arguments, just returns an undefined value. 'undef' is a special value in perl, which is neither '' nor 0 but compares as equal to both (so it's considered 'false' after tests), but can also be checked on its own. ex: undef %blah; # gets rid of the %blah hash undef $eek; # gets rid of the $eek scalar variable $eek = undef; # same * unlink -- Takes a list of file names (typically just one) and deletes them. Returns a true/false value as usual. ex: unlink "tempfile"; unlink "tempfile", "otherfile"; * unshift - Takes the name of an array, and then scalars and/or arrays, and adds them to the *beginning* of the first array. ex: @blah = ("a", "b"); $onx = "xterm"; unshift(@blah, "eeek", $onx, "nothing", 12.5); # @blah is now ("eeek", "xterm", "nothing", 12.5, "a", "b") * <> - Reads a line from the stdin and returns it. ex: $_ = <>; # read a line in $_ chop; # chop the \n at the end print "You typed: '$_'\n"; # re-print it Actually this command is much more general, and can be used to read from files after opening them, from network connections even... 5. Subroutines -------------- Up to now all the "programs" we'd be able to make are very linear, i.e they'd be executed from the beginning to the end, once each line, without tests or loops, and all in one big block. It is, however, possible to define our own functions (or procedures, or subroutines, or whatever we want to call them) in perl, and then call them just like we'd call perl's builtin functions. The way to define a function is this: sub nameofthefunction { contents; of; the; function; } and then the way to call it is &nameofthefunction; to call it without arguments, and &nameofthefunction($arguments, @go, $here); to call it with arguments. The '&' is the special character that identifies subs, in the same way '$' identifies scalars and @ arrays; once again, the namespaces are separate so you can have a variable with the same name of a sub. subs can be defined anywhere in the program, except (for obvious reasons of clarity) inside other subs. [ this is very perl 4-oriented; in perl 5 you can have subs inside each other, subs without names, and you don't necessarily need the special & sign to call a sub, but we said we'd stick to perl 4... ] subs can also be used before they are defined, since what perl does is first go through the whole script, compile it in memory, setting the subs aside, and then go through the whole thing again, executing then all the lines that are not inside a sub definition. ex: #!/usr/local/bin/perl $blah = 3; &showblah; $blah = 4; sub showblah { print "blah is $blah\n"; } &showblah; will first set $blah to 3, then call the sub, then set $blah to 4, then call it again; the place where it is defined makes no difference. In practice we usually want to keep the subs at the beginning of the script, typically interspersed with any initialization the variables they use require, and then the bulk of the "main" code at the end, after all the sub definitions. But perl never forces us to be logical or clear :) Inside a sub, we can access the parameters passed to it (if any) in the special array @_. So the first parameter passed will be $_[0], and the second $_[1], and so on. And as usual, this does not clobber the scalar variable $_. ex: sub printmyfirstargument { print $_[0], "\n"; } sub printmynumberofarguments { print $#_ + 1, "\n"; # this sure is ugly; the + 1 is necessary # because this $# thing gives the number of # arguments - 1 } sub addone { $_[0] = $_[0] + 1; # to show how functions can modify their # arguments } $blah = 3; &addone($blah); # blah is now 4 Also, each function returns a value, which is the last expression that was evaluated in it. From a sub, you can return at some point with the "return" command, which will then return the value you give it: sub one { if ($_[0] == 1) { # test if the first argument is 1 return "one"; # if so, return the string "one" } else { # otherwise... return "not one"; # return the string "not one" } # close the "if/else" } print &one(1), ", ", &one(3), "\n"; # prints "one, not one" 6. Operators ------------ We've seen that perl uses the usual operators in the usual way on numbers: + for addition, - for substraction, * for multiplication, / for division. Also, == tests numerical equality, and returns a "true" or "false" value depending on whether the two numbers were equal or not. The opposite test, for numerical difference, is !=. As expected, < tests that the first number is smaller than the second, > that the first is bigger, <= that the first is smaller or equal, and >= that the first is bigger or equal. Using ==, !=, <, >, <=, >= on strings that are not valid representations of numbers is one of the many things that generate a warning with the option -w. There are quite a bunch of other operators; some of them operate on strings: . does string concatenation eq tests string equality ne tests string inequality ex: 3 == "03" is true (actually evaluates to 1) 3 eq "03" is false "blah" eq "onx" is false "blah" ne "onx" is true "blah".3 is the same as "blah"."3", which is "blah3" "$blah eeek!" is the same as $blah." eeek!" (and depends on the value of $blah). Then there are the logical operators: && (logical and), || (logical or), and ! (logical negation). && and || go between the two expressions, while ! goes before the expression it negates. ex: 3 == "03" && "blah" ne "onx" is true, because both 3 and 03 have the same value, and the strings "blah" and "onx" are different. this expression actually works like this, because == and ne have tighter precedence than &&, but it could have been written more clearly as (3 == 03) && ("blah" ne "onx") !("eek" eq $blah) is the same as ("eek" ne $blah) The =~ operator is a world on its own; it matches patterns and replaces them. The syntax for this is: value =~ /pattern/; # tests if the var matches value !~ /pattern/; # short for !($variable =~ /pattern/;), # to check if it *doesn't* match $variable =~ s/pattern/replacement/; # searches for the pattern in the # variable and replaces it with the # replacement If the variable is ommited, =~ and !~ act on $_. In the first form, without the s/, =~ and !~ don't modify their arguments, but merely test if they match. In the second form, with s/, the replacement pattern (if any) is substituted for the matched one in the $variable. Also, some modifiers can be added just before the ;, to specify the matching mode: i - case-insensitive match g - global match, i.e matches more than one occurrance of the pattern rather than just one The patterns are "regular expressions" quite close to those of "grep", more on this later. ex: "blah" =~ /bl/; is true because "blah" contains a "bl"; it actually evaluates to 1 here "blahblah" =~ /bl/; also evaluates to 1 for the same reason "blahblah" =~ /bl/g; evaluates to 2, because we asked for a global match and there are 2 "bl"'s there "blah" !~ /c/i; evaluates to 1 (true) because there are no c's (lower or upper case) in "blah" $var = "blahblah"; $var =~ s/bl/QQ/; evaluates to 1 and leaves "QQahblah" in $var $var =~ s/BL/qq/ig; evaluates to 2 (2 substitutions made), and leaves "qqahqqah" in $var Finally, the assignation operator, =, assigns the second element to the first. There are some shortcut-operators for doing simple operations on variables: $var += $value; adds $value to $var $var -= $value; substracts $value to $var $var *= $value; multiplies $var by $value $var /= $value; divides $var by $value These are often used to increment counters by variable amounts: $counter += $step; $var++ evaluates to $var and adds one to it ++$var evaluates to $var+1 and leaves it in $var These are often used to increment/decrement counters by one; it is possible (but confusing) to use them within expressions. $var++; remains the easiest and most common way to add 1 to $var. All of this (except =~ and pattern matching) applies to C almost exactly the same way as it does to perl, except that C doesn't have operations on strings directly, so there is no "ne" or "eq" there. One particularity of the || and && operators is that they are quite often used as shortcuts for "if condition then do this", like: unlink "dumbfile" || die "Couldn't remove file\n"; As this will first try an unlink(), and if it's successful then the whole || expression will be true (a || is true as soon as one of its sides is, and they are always evaluated left to right, as well as &&'s), so it will not evaluate the "die", and the program will continue to run. On the other hand, if the unlink() fails, then to tell if the while || expression is true or false, perl will dutifully evaluate the die(), making the program exit with the given error message. This sounds like a really ugly trick, but it's actually *very* commonly used, nearly always in constructions like do_something || do_something_else_that_means_failure; and do_something && do_something_else_that_means_success; 7. Control Structures --------------------- All these structures can be nested, of course. * if -- test a condition, and act differently whether it's true or not The most general syntax is: if (condition goes here) { things; to; do; } elsif (another condition) { more_things; to_do; } else { even_more; things; } This is pretty self-explanatiory: perl will test the first condition, if it's true it will do the first block, if not it will test the second condition, if it's true do the second block, etc. There doesn't have to be any elsif()'s, or even an else; the structure can be as simple as: if (condition) { stuff; to_do; } or if (condition) { some_stuff; } else { other_stuff; } The {}'s are always necessary in this construction (unlike in C, where we don't need to write the {}'s whenever they enclose exactly one statement). There is an alternate shorter version for the simplest case (i.e when we don't have elsif's or else's, and when there is just one statement in the conditional block): if (condition) { do_stuff; } can be shortened to do_stuff if condition; perl is the only language I've seen that has constructions with the condition *after* the conditional code... it's nice and short though :) ex: (taken from dsirc, slightly simplified) if ($cmd eq 'ECHO') { &print($args); # this print is one that dsirc defines, # which does some formatting, hence the & } elsif ($cmd eq 'CLEAR' || $cmd eq 'CL') { print $cls if $ansi; print "`#ssfe#l\n" if $ssfe; } * unless -- does *exactly* the same as "if", except that it reverses the result of the test. ex: print "It's not one!\n" unless $number == 1; * while -- repeats a loop while a condition is true * until -- repeats a loop while a condition is false The syntax is (actually it can be quite a bit more complicated than that, giving names to blocks): while (condition) { # replace 'while' with 'until' to reverse # the condition things; to; do; } This will first test the condition, and if it's true, do all the things in the {} block, then test the condition again, and so on until the condition is false. Within a block, there are two additional commands that can be quite useful: * next -- skips the rest of the block and goes directly to test the condition again * last -- skips the rest of the block and gets out of the block ex: $i = 0; while ($i < 18) { print "$i "; $i++; } print "\n"; # prints all the numbers from 0 to 17 $i = 0; while ($i < 18) { next if $i == 4; print "$i "; $i++; } print "\n"; # prints 0, 1, 2, 3 and then gets caught in a never-ending loop, # since $i never gets incremented past 4, and the test is still # true. $i = 0; while ($i < 18) { last if $i == 4; print "$i "; $i++; } print "\n"; # prints only the numbers from 0 to 3 There is also the shortened version, for the case when there is only one statement to repeat: do_something while condition; do_something until condition; ex: &wait_for_something while &no_input; * foreach -- run a variable through the values of a list/array The general syntax is: foreach $variable (array or arrays or scalars) { do_something; do_something_else; } The $variable is optional, and if not given, $_ will be used. The arguments between the ()'s are an array or more arrays, or a list of variables, or function calls returning variables or arrays, or any mixture of these. All these scalars and arrays are "flattened" out into one array (just like push() does with everything after its first argument), and then the block between {}'s is executed once for each value, with $variable taking each of the values in turn. Careful with this: if the code in the {} loop modifies the value of the variable, then the value in the array will be modified too! This will modify *your* array if and only if that array is exactly the only thing between the ()'s; if there are other arrays/variables there, then perl builds a temporary array with all the values to iterate on, and it's that temporary array that gets modified (without any consequence). As with while/until, the special keywords "last" and "next" are available inside a foreach() loop; last gets out of the loop, while next skips the rest of the loop for that value and goes on to the next. ex: @blah = ("ick", 3, "onx", "bleargh"); %eek = (10, "no", 5, "yes", 7.5, "maybe"); foreach (@blah, keys(%eek), "heh") { print "|", $_, "| "; } print "\n"; # prints "|ick| |3| |onx| |bleargh| |10| |5| |7.5| |heh|" foreach (@blah, keys(%eek), "heh") { $_ = "bleh" if $_ == 3; } print join(":", @blah)."\n"; # modifies one of the values in memory, but @blah is unharmed # because it was a temporary copy for foreach(), so it still # prints "ick:3:onx:bleargh" foreach $i (@blah) { # we specify a variable name for a change $i = "bleh" if $i == 3; } print join(":", @blah)."\n"; # this time the statement in there modified the '3' into a 'bleh' # so at the end we get "ick:bleh:onx:bleargh" Another rather confusing thing is that 'for' can be used as an exact synonym of 'foreach', so you pick whichever you prefer. I think 'foreach' makes the code more clear. And then, there is another completely different structure that uses the same 'for' or 'foreach' keyword: for (statement1; condition; statement2) { things; to_do; } This is the exact equivalent of: statement1; while (condition) { things; to_do; statement2; } except that if you use 'next', instead of going to the test for 'condition', it goes to statement2, only skipping the rest of the "things to do" part. This structure is kind of redundant, and while() is usually easier to understand, but there are cases where it fits really well, i.e when 'statement1' is some kind of initialization, and 'statement2' is some kind of "go to the next one" thing. ex: for ($i = 0; $i < 18; $i++) { print "$i "; } print "\n"; # prints all the numbers from 0 to 17 for ($i = 0; $i < 18; $i++) { next if $i == 4; print "$i "; } print "\n"; # prints all the numbers from 0 to 17 except 4; this would have # been a little more annoying to do with a while(), because the # increment part gets skipped if we just do a "next if $i==4;" # so we'd have to introduce an if/else inside the while(). 8. Regular expressions ---------------------- Regular expressions are a whole little language inside perl, and are shared with grep, vi's "/" command, and many other Unix programs and libraries. Not two programs that use regular expressions understand them exactly the same way, though :) perl's regular expressions are very consistent and therefore quite easy to learn. A regular expression is usually enclosed between /'s, though this is more a "common convention" than a rule in perl. It is quite not uncommon to see regular expressions delimited by |'s or #'s or some other character, to avoid putting \'s in front of all /'s in the regexp. A regular expression defines a pattern that a will be matched against the text contained in scalar variables. In a regular expression, all alphanumeric characters (a to z, A to Z, 0 to 9, and _) match only themselves, while all other characters can have special meanings. A '\' followed by a non-alphabetical character matches exactly that character. In particular, '\\' matches '\'. ^ -- (only at the beginning of the regexp): makes the text match the pattern only if the pattern occurs at the beginning of the text $ -- (only at the end of the regexp): makes the text match the pattern only if the pattern occurs at the end of the text, i.e the regular expression is matched till the end of the text ex: "blah" =~ /la/ is true "blah" =~ /^la/ is false "blah" =~ /^bl/ is true "blah" =~ /h$/ is true "blah" =~ /la$/ is false "blah" =~ /^blah$/ is true . -- matches any single character \t -- matches a tab \s -- matches a space or a tab \S -- matches any single character other than a space or a tab \n -- matches a newline \w -- matches any single letter, digit, or _ \W -- matches any single character other than a letter, digit or _ \d -- matches any single digit (0 to 9) \D -- matches any single character but a digit ex: "blah" =~ /b.a/ is true "blah" =~ /b.la/ is false "blah" =~ /b\w.h$/ is true "blah" =~ /\w\D/ is true "blah" =~ /\d/ is false (because there are no digits *not* because it's not all digits) [characters] -- matches one character as long as it's one of those between the []'s also, ranges can be specified with -, like [m-z] matches any letter from 'm' to 'z'. if the first letter in the [] is a ^, then the meaning is reversed, and the []-expression matches one character as long as it's *not* one of those between the []'s ex: [-.0-9] matches exactly a '-', a '.', or a digit. [^\@ \t] matches exactly a character as long as it's not a @, a tab, or a space (the \ before the @ is completely optional here -- a @ matches itself just because it doesn't have any special meaning, but a \@ also matches a @ because a \@ before a non-letter makes it match only itself) ( ) -- groups a bit of regular expression into a single matchable unit * -- modifies the previous matchable unit to be accepted any number of times, including 0 + -- modifies the previous matchable unit to be accepted any number of times, but at least once ? -- makes the previous matchable unit optional ex: "blah" =~ /c*k*z?b+.l/ is true "ccccckkzbbbbb8lZOINX" =~ /c*k*z?b+.l/ is true too "blahblah" =~ /ah(EEK)?bl/ is true "blahEEKblah" =~ /ah(EEK)?bl/ is true "blahEEKEEKblah" =~ /ah(EEK)?bl/ is false "blahEEKEEKblah" =~ /ah(EEK)+bl/ is true "blah" =~ /b*.l/ is true for non-obvious reasons: if the 'b*' matches the 'b' in 'blah', then the . has to match the l, and the following l in the pattern doesn't work with anything. but 'b*' also matches the empty string, and then the '.' can match the 'b', and the 'l' the other 'l', so it does work. the point being, all we test is that it matches in at least *one* way, but looking at it from left to right matching as much as we can every time is not always the way to see it. /^([^\@ \t]+\@[^\@ \t])+\s+([-.0-9]+)$/ matches any line starting with a non-0 number of characters that are neither @'s nor spaces nor tabs, then a @, then more characters that are neither @'s nor spaces or tabs, then some spaces or tabs, and then any mixture of -'s, .'s and digits. in other words, an email address followed by spaces/tabs and then something that looks reasonably like a number. (the extra ()'s around parts of the pattern are there so that those parts can be isolated later in the variables $1, $2...). /^\s*(\d+)\.(\d+)\.(\d+)\.(\d+)\s*$/ matches optional whitespace, folowed by something that looks an awful lot like an IP address. When a pattern is successfully matched, some "reserved variables" are set to values that tell about different bits of the matched pattern: $& -- is set to the whole matched pattern $` -- is set to all the text that was not matched before the pattern $' -- is set to all the text that was not matched after the pattern ex: "blahblahblahfrenchfries" =~ /(hbla)+/; # __ ^^^^^^^^____________ # $` $& $' # # at this point $& is "hblahbla" # $` is "bla" # $' is "hfrenchfries" Also, if the pattern had ()'s sub-patterns, then those are assigned to numbered variables, $1 for the first, $2 for the second, and so on: " 129.199.129.13" =~ /^\s*(\d+)\.(\d+)\.(\d+)\.(\d+)\s*$/; # at this point $1 is 129, $2 is 199, $3 is 129, $4 is 13 A pattern match with subsitution is written like this: $variable =~ s/pattern/replacement/; The pattern is a regular expression just like before, and the replacement is an ordinary text string, except that '$'s are interpolated in it, making it possible to introduce variables like $1, $2 back into the same string. In most cases it makes sense to add the flag 'g' after the replacement, to make it replace all occurrences of the pattern instead of just the first one. The flag 'i' can be useful too, and means that the match should be done without taking case into account. ex: $waitops{$c} =~ s/:${oldnick}:/:${newnick}:/gi; Regular expressions also work together with the perl function 'split', which takes as arguments a regular expression, a scalar variable, and optionally second scalar which specifies the maximum number of fields to split into. @array = split(/pattern/, expression); splits the expression into an array of strings, and returns it. The pattern is a regular expression specifying which substrings of the expression to split are to be considered separators. In the most typical case, we'll want to split a line into words, like this: @words = split(/\s+/, $some_line); # if $some_line was "blah blah blah french fries", now # @words is ("blah", "blah", "blah", "french", "fries"). or even: ($val1, $val2) = split(/\s+/, $whatever); # sets $val1 to the first word of $whatever, and $val2 to the # second (and any 3rd or other words will just not be kept) If a third argument 'n' is specified, split() will not split into more than n parts, which means that the last part can actually contain more than one "word": ($val1, $val2) = split(/\s+/, $whatever, 2); # sets $val1 to the first word of $whatever, and $val2 to # the 2nd word and all that comes after it 9. Local variables ------------------ At any moment, we can declare local variables, so a sub can work on some variables without the danger of clobbering other variables that are being used by other subs that have called this one. The way to do this is: local($variable, $anothervar, @even_an_array); This creates these new variables the moment the statement is executed, saving the previous values of the variables of the same name if they existed, and later these local variables are destroyed, and the saved values are restored. They can also be initialized at the same time: local($variable, $anothervar)=(value, anothervalue); Local variables are kept within their "scope", which is until the end of the innermost {}-block they are defined in. In other words, until the next '}' that closes a block (an "if", "while", "for", or similar block). In particular, it is very common for subs to start by defining local variables to give names to the arguments they have been passed: ex: sub addhelp { # this sub is passed 2 arguments, a command name and some help text local ($cmd, $txt) = @_; $cmd =~ tr/A-Z/a-z/; # I haven't explained this "translation" operator, what it does # is translate characters one by one; in this case A becomes a, # B becomes b, and so on, so it turns $cmd into lower case foreach (split(/\n+/, $txt) { # for each of the lines in $txt (the 2nd argument) next unless $_; # skip empty lines push (@help, $_); # add it to some global array } # at this point these variables $cmd and $txt disappear, so # if the sub called addhelp had a variable called $cmd too, # it has kept its value undisturbed } 10. General comments; presentation ---------------------------------- perl code, as with C code or code in just about any modern computer language, is a rather structured thing, with nested loops enclosed within delimiters like {}. It is quite important to write code in a way that shows this structure, so one can immediately see where each block ends, and what matches what. There are many possible styles of writing, some people code like this: sub blah { print "something"; if ($something_else) { &do_something; return; } even_more_stuff; } while I would write it like this: sub blah { print "something"; if ($somethign_else) { &do_something; return; } even_more_stuff; } The important thing is to be consistent, and always indent the same way; the perl interpreter itself takes absolutely no notice of indentation, blank lines, spaces after commas, and spaces between operators, so these are there only for us to be able to read what we've written. I usually don't put spaces between operators, so I write $n+3 rather than $n + 3, but in this file I've done it with the spaces; maybe it's somewhat clearer. Spaces after commas are quite a necessity, somethign like &blah($eek,$onx$blurf); is much less readable than &blah($eek, $onx, $blurf); And the % key, in vi, is a huge help to anyone trying to get perl code to work ... you press it on a ( or ) or { or } or [ or ] and it finds the matching one for you. 11. What's missing ------------------ * files, file descriptors and how they're yet another namespace, how you open them, write to them, read from them, use print and <> on them * command-line arguments... just one word, they're in @ARGV * detailed explanation of the difference between an undef and an empty string and where we get each * the remaining builtin functions, eval, grep, splice * more unix-specific stuff like fork, kill, network connections etc. * the "/bin/test"-like tests on files * the more general forms of while() and for(), giving names to blocks. * any perl5-specific stuff 12. References -------------- man perl