home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Geek Gadgets 1
/
ADE-1.bin
/
ade-dist
/
gawk-2.15.6-src.tgz
/
tar.out
/
fsf
/
gawk
/
gawk.info-3
(
.txt
)
< prev
next >
Wrap
GNU Info File
|
1996-09-28
|
50KB
|
909 lines
This is Info file gawk.info, produced by Makeinfo-1.55 from the input
file /gnu-src/gawk-2.15.6/gawk.texi.
This file documents `awk', a program that you can use to select
particular records in a file and perform operations upon them.
This is Edition 0.15 of `The GAWK Manual',
for the 2.15 version of the GNU implementation
of AWK.
Copyright (C) 1989, 1991, 1992, 1993 Free Software Foundation, Inc.
Permission is granted to make and distribute verbatim copies of this
manual provided the copyright notice and this permission notice are
preserved on all copies.
Permission is granted to copy and distribute modified versions of
this manual under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.
Permission is granted to copy and distribute translations of this
manual into another language, under the above conditions for modified
versions, except that this permission notice may be stated in a
translation approved by the Foundation.
File: gawk.info, Node: Output Separators, Next: OFMT, Prev: Print Examples, Up: Printing
Output Separators
=================
As mentioned previously, a `print' statement contains a list of
items, separated by commas. In the output, the items are normally
separated by single spaces. But they do not have to be spaces; a
single space is only the default. You can specify any string of
characters to use as the "output field separator" by setting the
built-in variable `OFS'. The initial value of this variable is the
string `" "', that is, just a single space.
The output from an entire `print' statement is called an "output
record". Each `print' statement outputs one output record and then
outputs a string called the "output record separator". The built-in
variable `ORS' specifies this string. The initial value of the
variable is the string `"\n"' containing a newline character; thus,
normally each `print' statement makes a separate line.
You can change how output fields and records are separated by
assigning new values to the variables `OFS' and/or `ORS'. The usual
place to do this is in the `BEGIN' rule (*note `BEGIN' and `END'
Special Patterns: BEGIN/END.), so that it happens before any input is
processed. You may also do this with assignments on the command line,
before the names of your input files.
The following example prints the first and second fields of each
input record separated by a semicolon, with a blank line added after
each line:
awk 'BEGIN { OFS = ";"; ORS = "\n\n" }
{ print $1, $2 }' BBS-list
If the value of `ORS' does not contain a newline, all your output
will be run together on a single line, unless you output newlines some
other way.
File: gawk.info, Node: OFMT, Next: Printf, Prev: Output Separators, Up: Printing
Controlling Numeric Output with `print'
=======================================
When you use the `print' statement to print numeric values, `awk'
internally converts the number to a string of characters, and prints
that string. `awk' uses the `sprintf' function to do this conversion.
For now, it suffices to say that the `sprintf' function accepts a
"format specification" that tells it how to format numbers (or
strings), and that there are a number of different ways that numbers
can be formatted. The different format specifications are discussed
more fully in *Note Using `printf' Statements for Fancier Printing:
Printf.
The built-in variable `OFMT' contains the default format
specification that `print' uses with `sprintf' when it wants to convert
a number to a string for printing. By supplying different format
specifications as the value of `OFMT', you can change how `print' will
print your numbers. As a brief example:
awk 'BEGIN { OFMT = "%d" # print numbers as integers
print 17.23 }'
will print `17'.
File: gawk.info, Node: Printf, Next: Redirection, Prev: OFMT, Up: Printing
Using `printf' Statements for Fancier Printing
==============================================
If you want more precise control over the output format than `print'
gives you, use `printf'. With `printf' you can specify the width to
use for each item, and you can specify various stylistic choices for
numbers (such as what radix to use, whether to print an exponent,
whether to print a sign, and how many digits to print after the decimal
point). You do this by specifying a string, called the "format
string", which controls how and where to print the other arguments.
* Menu:
* Basic Printf:: Syntax of the `printf' statement.
* Control Letters:: Format-control letters.
* Format Modifiers:: Format-specification modifiers.
* Printf Examples:: Several examples.
File: gawk.info, Node: Basic Printf, Next: Control Letters, Prev: Printf, Up: Printf
Introduction to the `printf' Statement
--------------------------------------
The `printf' statement looks like this:
printf FORMAT, ITEM1, ITEM2, ...
The entire list of arguments may optionally be enclosed in parentheses.
The parentheses are necessary if any of the item expressions uses a
relational operator; otherwise it could be confused with a redirection
(*note Redirecting Output of `print' and `printf': Redirection.). The
relational operators are `==', `!=', `<', `>', `>=', `<=', `~' and `!~'
(*note Comparison Expressions: Comparison Ops.).
The difference between `printf' and `print' is the argument FORMAT.
This is an expression whose value is taken as a string; it specifies
how to output each of the other arguments. It is called the "format
string".
The format string is the same as in the ANSI C library function
`printf'. Most of FORMAT is text to be output verbatim. Scattered
among this text are "format specifiers", one per item. Each format
specifier says to output the next item at that place in the format.
The `printf' statement does not automatically append a newline to its
output. It outputs only what the format specifies. So if you want a
newline, you must include one in the format. The output separator
variables `OFS' and `ORS' have no effect on `printf' statements.
File: gawk.info, Node: Control Letters, Next: Format Modifiers, Prev: Basic Printf, Up: Printf
Format-Control Letters
----------------------
A format specifier starts with the character `%' and ends with a
"format-control letter"; it tells the `printf' statement how to output
one item. (If you actually want to output a `%', write `%%'.) The
format-control letter specifies what kind of value to print. The rest
of the format specifier is made up of optional "modifiers" which are
parameters such as the field width to use.
Here is a list of the format-control letters:
This prints a number as an ASCII character. Thus, `printf "%c",
65' outputs the letter `A'. The output for a string value is the
first character of the string.
This prints a decimal integer.
This also prints a decimal integer.
This prints a number in scientific (exponential) notation. For
example,
printf "%4.3e", 1950
prints `1.950e+03', with a total of four significant figures of
which three follow the decimal point. The `4.3' are "modifiers",
discussed below.
This prints a number in floating point notation.
This prints a number in either scientific notation or floating
point notation, whichever uses fewer characters.
This prints an unsigned octal integer.
This prints a string.
This prints an unsigned hexadecimal integer.
This prints an unsigned hexadecimal integer. However, for the
values 10 through 15, it uses the letters `A' through `F' instead
of `a' through `f'.
This isn't really a format-control letter, but it does have a
meaning when used after a `%': the sequence `%%' outputs one `%'.
It does not consume an argument.
File: gawk.info, Node: Format Modifiers, Next: Printf Examples, Prev: Control Letters, Up: Printf
Modifiers for `printf' Formats
------------------------------
A format specification can also include "modifiers" that can control
how much of the item's value is printed and how much space it gets. The
modifiers come between the `%' and the format-control letter. Here are
the possible modifiers, in the order in which they may appear:
The minus sign, used before the width modifier, says to
left-justify the argument within its specified width. Normally
the argument is printed right-justified in the specified width.
Thus,
printf "%-4s", "foo"
prints `foo '.
`WIDTH'
This is a number representing the desired width of a field.
Inserting any number between the `%' sign and the format control
character forces the field to be expanded to this width. The
default way to do this is to pad with spaces on the left. For
example,
printf "%4s", "foo"
prints ` foo'.
The value of WIDTH is a minimum width, not a maximum. If the item
value requires more than WIDTH characters, it can be as wide as
necessary. Thus,
printf "%4s", "foobar"
prints `foobar'.
Preceding the WIDTH with a minus sign causes the output to be
padded with spaces on the right, instead of on the left.
`.PREC'
This is a number that specifies the precision to use when printing.
This specifies the number of digits you want printed to the right
of the decimal point. For a string, it specifies the maximum
number of characters from the string that should be printed.
The C library `printf''s dynamic WIDTH and PREC capability (for
example, `"%*.*s"') is supported. Instead of supplying explicit WIDTH
and/or PREC values in the format string, you pass them in the argument
list. For example:
w = 5
p = 3
s = "abcdefg"
printf "<%*.*s>\n", w, p, s
is exactly equivalent to
s = "abcdefg"
printf "<%5.3s>\n", s
Both programs output `<**abc>'. (We have used the bullet symbol "*" to
represent a space, to clearly show you that there are two spaces in the
output.)
Earlier versions of `awk' did not support this capability. You may
simulate it by using concatenation to build up the format string, like
w = 5
p = 3
s = "abcdefg"
printf "<%" w "." p "s>\n", s
This is not particularly easy to read, however.
File: gawk.info, Node: Printf Examples, Prev: Format Modifiers, Up: Printf
Examples of Using `printf'
--------------------------
Here is how to use `printf' to make an aligned table:
awk '{ printf "%-10s %s\n", $1, $2 }' BBS-list
prints the names of bulletin boards (`$1') of the file `BBS-list' as a
string of 10 characters, left justified. It also prints the phone
numbers (`$2') afterward on the line. This produces an aligned
two-column table of names and phone numbers:
aardvark 555-5553
alpo-net 555-3412
barfly 555-7685
bites 555-1675
camelot 555-0542
core 555-2912
fooey 555-1234
foot 555-6699
macfoo 555-6480
sdace 555-3430
sabafoo 555-2127
Did you notice that we did not specify that the phone numbers be
printed as numbers? They had to be printed as strings because the
numbers are separated by a dash. This dash would be interpreted as a
minus sign if we had tried to print the phone numbers as numbers. This
would have led to some pretty confusing results.
We did not specify a width for the phone numbers because they are the
last things on their lines. We don't need to put spaces after them.
We could make our table look even nicer by adding headings to the
tops of the columns. To do this, use the `BEGIN' pattern (*note
`BEGIN' and `END' Special Patterns: BEGIN/END.) to force the header to
be printed only once, at the beginning of the `awk' program:
awk 'BEGIN { print "Name Number"
print "---- ------" }
{ printf "%-10s %s\n", $1, $2 }' BBS-list
Did you notice that we mixed `print' and `printf' statements in the
above example? We could have used just `printf' statements to get the
same results:
awk 'BEGIN { printf "%-10s %s\n", "Name", "Number"
printf "%-10s %s\n", "----", "------" }
{ printf "%-10s %s\n", $1, $2 }' BBS-list
By outputting each column heading with the same format specification
used for the elements of the column, we have made sure that the headings
are aligned just like the columns.
The fact that the same format specification is used three times can
be emphasized by storing it in a variable, like this:
awk 'BEGIN { format = "%-10s %s\n"
printf format, "Name", "Number"
printf format, "----", "------" }
{ printf format, $1, $2 }' BBS-list
See if you can use the `printf' statement to line up the headings and
table data for our `inventory-shipped' example covered earlier in the
section on the `print' statement (*note The `print' Statement: Print.).
File: gawk.info, Node: Redirection, Next: Special Files, Prev: Printf, Up: Printing
Redirecting Output of `print' and `printf'
==========================================
So far we have been dealing only with output that prints to the
standard output, usually your terminal. Both `print' and `printf' can
also send their output to other places. This is called "redirection".
A redirection appears after the `print' or `printf' statement.
Redirections in `awk' are written just like redirections in shell
commands, except that they are written inside the `awk' program.
* Menu:
* File/Pipe Redirection:: Redirecting Output to Files and Pipes.
* Close Output:: How to close output files and pipes.
File: gawk.info, Node: File/Pipe Redirection, Next: Close Output, Prev: Redirection, Up: Redirection
Redirecting Output to Files and Pipes
-------------------------------------
Here are the three forms of output redirection. They are all shown
for the `print' statement, but they work identically for `printf' also.
`print ITEMS > OUTPUT-FILE'
This type of redirection prints the items onto the output file
OUTPUT-FILE. The file name OUTPUT-FILE can be any expression.
Its value is changed to a string and then used as a file name
(*note Expressions as Action Statements: Expressions.).
When this type of redirection is used, the OUTPUT-FILE is erased
before the first output is written to it. Subsequent writes do not
erase OUTPUT-FILE, but append to it. If OUTPUT-FILE does not
exist, then it is created.
For example, here is how one `awk' program can write a list of BBS
names to a file `name-list' and a list of phone numbers to a file
`phone-list'. Each output file contains one name or number per
line.
awk '{ print $2 > "phone-list"
print $1 > "name-list" }' BBS-list
`print ITEMS >> OUTPUT-FILE'
This type of redirection prints the items onto the output file
OUTPUT-FILE. The difference between this and the single-`>'
redirection is that the old contents (if any) of OUTPUT-FILE are
not erased. Instead, the `awk' output is appended to the file.
`print ITEMS | COMMAND'
It is also possible to send output through a "pipe" instead of
into a file. This type of redirection opens a pipe to COMMAND
and writes the values of ITEMS through this pipe, to another
process created to execute COMMAND.
The redirection argument COMMAND is actually an `awk' expression.
Its value is converted to a string, whose contents give the shell
command to be run.
For example, this produces two files, one unsorted list of BBS
names and one list sorted in reverse alphabetical order:
awk '{ print $1 > "names.unsorted"
print $1 | "sort -r > names.sorted" }' BBS-list
Here the unsorted list is written with an ordinary redirection
while the sorted list is written by piping through the `sort'
utility.
Here is an example that uses redirection to mail a message to a
mailing list `bug-system'. This might be useful when trouble is
encountered in an `awk' script run periodically for system
maintenance.
report = "mail bug-system"
print "Awk script failed:", $0 | report
print "at record number", FNR, "of", FILENAME | report
close(report)
We call the `close' function here because it's a good idea to close
the pipe as soon as all the intended output has been sent to it.
*Note Closing Output Files and Pipes: Close Output, for more
information on this. This example also illustrates the use of a
variable to represent a FILE or COMMAND: it is not necessary to
always use a string constant. Using a variable is generally a
good idea, since `awk' requires you to spell the string value
identically every time.
Redirecting output using `>', `>>', or `|' asks the system to open a
file or pipe only if the particular FILE or COMMAND you've specified
has not already been written to by your program, or if it has been
closed since it was last written to.
File: gawk.info, Node: Close Output, Prev: File/Pipe Redirection, Up: Redirection
Closing Output Files and Pipes
------------------------------
When a file or pipe is opened, the file name or command associated
with it is remembered by `awk' and subsequent writes to the same file or
command are appended to the previous writes. The file or pipe stays
open until `awk' exits. This is usually convenient.
Sometimes there is a reason to close an output file or pipe earlier
than that. To do this, use the `close' function, as follows:
close(FILENAME)
close(COMMAND)
The argument FILENAME or COMMAND can be any expression. Its value
must exactly equal the string used to open the file or pipe to begin
with--for example, if you open a pipe with this:
print $1 | "sort -r > names.sorted"
then you must close it with this:
close("sort -r > names.sorted")
Here are some reasons why you might need to close an output file:
* To write a file and read it back later on in the same `awk'
program. Close the file when you are finished writing it; then
you can start reading it with `getline' (*note Explicit Input with
`getline': Getline.).
* To write numerous files, successively, in the same `awk' program.
If you don't close the files, eventually you may exceed a system
limit on the number of open files in one process. So close each
one when you are finished writing it.
* To make a command finish. When you redirect output through a pipe,
the command reading the pipe normally continues to try to read
input as long as the pipe is open. Often this means the command
cannot really do its work until the pipe is closed. For example,
if you redirect output to the `mail' program, the message is not
actually sent until the pipe is closed.
* To run the same program a second time, with the same arguments.
This is not the same thing as giving more input to the first run!
For example, suppose you pipe output to the `mail' program. If you
output several lines redirected to this pipe without closing it,
they make a single message of several lines. By contrast, if you
close the pipe after each line of output, then each line makes a
separate message.
`close' returns a value of zero if the close succeeded. Otherwise,
the value will be non-zero. In this case, `gawk' sets the variable
`ERRNO' to a string describing the error that occurred.
File: gawk.info, Node: Special Files, Prev: Redirection, Up: Printing
Standard I/O Streams
====================
Running programs conventionally have three input and output streams
already available to them for reading and writing. These are known as
the "standard input", "standard output", and "standard error output".
These streams are, by default, terminal input and output, but they are
often redirected with the shell, via the `<', `<<', `>', `>>', `>&' and
`|' operators. Standard error is used only for writing error messages;
the reason we have two separate streams, standard output and standard
error, is so that they can be redirected separately.
In other implementations of `awk', the only way to write an error
message to standard error in an `awk' program is as follows:
print "Serious error detected!\n" | "cat 1>&2"
This works by opening a pipeline to a shell command which can access the
standard error stream which it inherits from the `awk' process. This
is far from elegant, and is also inefficient, since it requires a
separate process. So people writing `awk' programs have often
neglected to do this. Instead, they have sent the error messages to the
terminal, like this:
NF != 4 {
printf("line %d skipped: doesn't have 4 fields\n", FNR) > "/dev/tty"
}
This has the same effect most of the time, but not always: although the
standard error stream is usually the terminal, it can be redirected, and
when that happens, writing to the terminal is not correct. In fact, if
`awk' is run from a background job, it may not have a terminal at all.
Then opening `/dev/tty' will fail.
`gawk' provides special file names for accessing the three standard
streams. When you redirect input or output in `gawk', if the file name
matches one of these special names, then `gawk' directly uses the
stream it stands for.
`/dev/stdin'
The standard input (file descriptor 0).
`/dev/stdout'
The standard output (file descriptor 1).
`/dev/stderr'
The standard error output (file descriptor 2).
`/dev/fd/N'
The file associated with file descriptor N. Such a file must have
been opened by the program initiating the `awk' execution
(typically the shell). Unless you take special pains, only
descriptors 0, 1 and 2 are available.
The file names `/dev/stdin', `/dev/stdout', and `/dev/stderr' are
aliases for `/dev/fd/0', `/dev/fd/1', and `/dev/fd/2', respectively,
but they are more self-explanatory.
The proper way to write an error message in a `gawk' program is to
use `/dev/stderr', like this:
NF != 4 {
printf("line %d skipped: doesn't have 4 fields\n", FNR) > "/dev/stderr"
}
`gawk' also provides special file names that give access to
information about the running `gawk' process. Each of these "files"
provides a single record of information. To read them more than once,
you must first close them with the `close' function (*note Closing
Input Files and Pipes: Close Input.). The filenames are:
`/dev/pid'
Reading this file returns the process ID of the current process,
in decimal, terminated with a newline.
`/dev/ppid'
Reading this file returns the parent process ID of the current
process, in decimal, terminated with a newline.
`/dev/pgrpid'
Reading this file returns the process group ID of the current
process, in decimal, terminated with a newline.
`/dev/user'
Reading this file returns a single record terminated with a
newline. The fields are separated with blanks. The fields
represent the following information:
`$1'
The value of the `getuid' system call.
`$2'
The value of the `geteuid' system call.
`$3'
The value of the `getgid' system call.
`$4'
The value of the `getegid' system call.
If there are any additional fields, they are the group IDs
returned by `getgroups' system call. (Multiple groups may not be
supported on all systems.)
These special file names may be used on the command line as data
files, as well as for I/O redirections within an `awk' program. They
may not be used as source files with the `-f' option.
Recognition of these special file names is disabled if `gawk' is in
compatibility mode (*note Invoking `awk': Command Line.).
*Caution*: Unless your system actually has a `/dev/fd' directory
(or any of the other above listed special files), the
interpretation of these file names is done by `gawk' itself. For
example, using `/dev/fd/4' for output will actually write on file
descriptor 4, and not on a new file descriptor that was `dup''ed
from file descriptor 4. Most of the time this does not matter;
however, it is important to *not* close any of the files related
to file descriptors 0, 1, and 2. If you do close one of these
files, unpredictable behavior will result.
File: gawk.info, Node: One-liners, Next: Patterns, Prev: Printing, Up: Top
Useful "One-liners"
*******************
Useful `awk' programs are often short, just a line or two. Here is a
collection of useful, short programs to get you started. Some of these
programs contain constructs that haven't been covered yet. The
description of the program will give you a good idea of what is going
on, but please read the rest of the manual to become an `awk' expert!
Since you are reading this in Info, each line of the example code is
enclosed in quotes, to represent text that you would type literally.
The examples themselves represent shell commands that use single quotes
to keep the shell from interpreting the contents of the program. When
reading the examples, focus on the text between the open and close
quotes.
`awk '{ if (NF > max) max = NF }'
` END { print max }''
This program prints the maximum number of fields on any input line.
`awk 'length($0) > 80''
This program prints every line longer than 80 characters. The sole
rule has a relational expression as its pattern, and has no action
(so the default action, printing the record, is used).
`awk 'NF > 0''
This program prints every line that has at least one field. This
is an easy way to delete blank lines from a file (or rather, to
create a new file similar to the old file but from which the blank
lines have been deleted).
`awk '{ if (NF > 0) print }''
This program also prints every line that has at least one field.
Here we allow the rule to match every line, then decide in the
action whether to print.
`awk 'BEGIN { for (i = 1; i <= 7; i++)'
` print int(101 * rand()) }''
This program prints 7 random numbers from 0 to 100, inclusive.
`ls -l FILES | awk '{ x += $4 } ; END { print "total bytes: " x }''
This program prints the total number of bytes used by FILES.
`expand FILE | awk '{ if (x < length()) x = length() }'
` END { print "maximum line length is " x }''
This program prints the maximum line length of FILE. The input is
piped through the `expand' program to change tabs into spaces, so
the widths compared are actually the right-margin columns.
`awk 'BEGIN { FS = ":" }'
` { print $1 | "sort" }' /etc/passwd'
This program prints a sorted list of the login names of all users.
`awk '{ nlines++ }'
` END { print nlines }''
This programs counts lines in a file.
`awk 'END { print NR }''
This program also counts lines in a file, but lets `awk' do the
work.
`awk '{ print NR, $0 }''
This program adds line numbers to all its input files, similar to
`cat -n'.
File: gawk.info, Node: Patterns, Next: Actions, Prev: One-liners, Up: Top
Patterns
********
Patterns in `awk' control the execution of rules: a rule is executed
when its pattern matches the current input record. This chapter tells
all about how to write patterns.
* Menu:
* Kinds of Patterns:: A list of all kinds of patterns.
The following subsections describe
them in detail.
* Regexp:: Regular expressions such as `/foo/'.
* Comparison Patterns:: Comparison expressions such as `$1 > 10'.
* Boolean Patterns:: Combining comparison expressions.
* Expression Patterns:: Any expression can be used as a pattern.
* Ranges:: Pairs of patterns specify record ranges.
* BEGIN/END:: Specifying initialization and cleanup rules.
* Empty:: The empty pattern, which matches every record.
File: gawk.info, Node: Kinds of Patterns, Next: Regexp, Prev: Patterns, Up: Patterns
Kinds of Patterns
=================
Here is a summary of the types of patterns supported in `awk'.
`/REGULAR EXPRESSION/'
A regular expression as a pattern. It matches when the text of the
input record fits the regular expression. (*Note Regular
Expressions as Patterns: Regexp.)
`EXPRESSION'
A single expression. It matches when its value, converted to a
number, is nonzero (if a number) or nonnull (if a string). (*Note
Expressions as Patterns: Expression Patterns.)
`PAT1, PAT2'
A pair of patterns separated by a comma, specifying a range of
records. (*Note Specifying Record Ranges with Patterns: Ranges.)
`BEGIN'
`END'
Special patterns to supply start-up or clean-up information to
`awk'. (*Note `BEGIN' and `END' Special Patterns: BEGIN/END.)
`NULL'
The empty pattern matches every input record. (*Note The Empty
Pattern: Empty.)
File: gawk.info, Node: Regexp, Next: Comparison Patterns, Prev: Kinds of Patterns, Up: Patterns
Regular Expressions as Patterns
===============================
A "regular expression", or "regexp", is a way of describing a class
of strings. A regular expression enclosed in slashes (`/') is an `awk'
pattern that matches every input record whose text belongs to that
class.
The simplest regular expression is a sequence of letters, numbers, or
both. Such a regexp matches any string that contains that sequence.
Thus, the regexp `foo' matches any string containing `foo'. Therefore,
the pattern `/foo/' matches any input record containing `foo'. Other
kinds of regexps let you specify more complicated classes of strings.
* Menu:
* Regexp Usage:: How to Use Regular Expressions
* Regexp Operators:: Regular Expression Operators
* Case-sensitivity:: How to do case-insensitive matching.
File: gawk.info, Node: Regexp Usage, Next: Regexp Operators, Prev: Regexp, Up: Regexp
How to Use Regular Expressions
------------------------------
A regular expression can be used as a pattern by enclosing it in
slashes. Then the regular expression is matched against the entire
text of each record. (Normally, it only needs to match some part of
the text in order to succeed.) For example, this prints the second
field of each record that contains `foo' anywhere:
awk '/foo/ { print $2 }' BBS-list
Regular expressions can also be used in comparison expressions. Then
you can specify the string to match against; it need not be the entire
current input record. These comparison expressions can be used as
patterns or in `if', `while', `for', and `do' statements.
`EXP ~ /REGEXP/'
This is true if the expression EXP (taken as a character string)
is matched by REGEXP. The following example matches, or selects,
all input records with the upper-case letter `J' somewhere in the
first field:
awk '$1 ~ /J/' inventory-shipped
So does this:
awk '{ if ($1 ~ /J/) print }' inventory-shipped
`EXP !~ /REGEXP/'
This is true if the expression EXP (taken as a character string)
is *not* matched by REGEXP. The following example matches, or
selects, all input records whose first field *does not* contain
the upper-case letter `J':
awk '$1 !~ /J/' inventory-shipped
The right hand side of a `~' or `!~' operator need not be a constant
regexp (i.e., a string of characters between slashes). It may be any
expression. The expression is evaluated, and converted if necessary to
a string; the contents of the string are used as the regexp. A regexp
that is computed in this way is called a "dynamic regexp". For example:
identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+"
$0 ~ identifier_regexp
sets `identifier_regexp' to a regexp that describes `awk' variable
names, and tests if the input record matches this regexp.
File: gawk.info, Node: Regexp Operators, Next: Case-sensitivity, Prev: Regexp Usage, Up: Regexp
Regular Expression Operators
----------------------------
You can combine regular expressions with the following characters,
called "regular expression operators", or "metacharacters", to increase
the power and versatility of regular expressions.
Here is a table of metacharacters. All characters not listed in the
table stand for themselves.
This matches the beginning of the string or the beginning of a line
within the string. For example:
^@chapter
matches the `@chapter' at the beginning of a string, and can be
used to identify chapter beginnings in Texinfo source files.
This is similar to `^', but it matches only at the end of a string
or the end of a line within the string. For example:
p$
matches a record that ends with a `p'.
This matches any single character except a newline. For example:
.P
matches any single character followed by a `P' in a string. Using
concatenation we can make regular expressions like `U.A', which
matches any three-character sequence that begins with `U' and ends
with `A'.
`[...]'
This is called a "character set". It matches any one of the
characters that are enclosed in the square brackets. For example:
[MVX]
matches any one of the characters `M', `V', or `X' in a string.
Ranges of characters are indicated by using a hyphen between the
beginning and ending characters, and enclosing the whole thing in
brackets. For example:
[0-9]
matches any digit.
To include the character `\', `]', `-' or `^' in a character set,
put a `\' in front of it. For example:
[d\]]
matches either `d', or `]'.
This treatment of `\' is compatible with other `awk'
implementations, and is also mandated by the POSIX Command Language
and Utilities standard. The regular expressions in `awk' are a
superset of the POSIX specification for Extended Regular
Expressions (EREs). POSIX EREs are based on the regular
expressions accepted by the traditional `egrep' utility.
In `egrep' syntax, backslash is not syntactically special within
square brackets. This means that special tricks have to be used to
represent the characters `]', `-' and `^' as members of a
character set.
In `egrep' syntax, to match `-', write it as `---', which is a
range containing only `-'. You may also give `-' as the first or
last character in the set. To match `^', put it anywhere except
as the first character of a set. To match a `]', make it the
first character in the set. For example:
[]d^]
matches either `]', `d' or `^'.
`[^ ...]'
This is a "complemented character set". The first character after
the `[' *must* be a `^'. It matches any characters *except* those
in the square brackets (or newline). For example:
[^0-9]
matches any character that is not a digit.
This is the "alternation operator" and it is used to specify
alternatives. For example:
^P|[0-9]
matches any string that matches either `^P' or `[0-9]'. This
means it matches any string that contains a digit or starts with
`P'.
The alternation applies to the largest possible regexps on either
side.
`(...)'
Parentheses are used for grouping in regular expressions as in
arithmetic. They can be used to concatenate regular expressions
containing the alternation operator, `|'.
This symbol means that the preceding regular expression is to be
repeated as many times as possible to find a match. For example:
ph*
applies the `*' symbol to the preceding `h' and looks for matches
to one `p' followed by any number of `h's. This will also match
just `p' if no `h's are present.
The `*' repeats the *smallest* possible preceding expression.
(Use parentheses if you wish to repeat a larger expression.) It
finds as many repetitions as possible. For example:
awk '/\(c[ad][ad]*r x\)/ { print }' sample
prints every record in the input containing a string of the form
`(car x)', `(cdr x)', `(cadr x)', and so on.
This symbol is similar to `*', but the preceding expression must be
matched at least once. This means that:
wh+y
would match `why' and `whhy' but not `wy', whereas `wh*y' would
match all three of these strings. This is a simpler way of
writing the last `*' example:
awk '/\(c[ad]+r x\)/ { print }' sample
This symbol is similar to `*', but the preceding expression can be
matched once or not at all. For example:
fe?d
will match `fed' and `fd', but nothing else.
This is used to suppress the special meaning of a character when
matching. For example:
\$
matches the character `$'.
The escape sequences used for string constants (*note Constant
Expressions: Constants.) are valid in regular expressions as well;
they are also introduced by a `\'.
In regular expressions, the `*', `+', and `?' operators have the
highest precedence, followed by concatenation, and finally by `|'. As
in arithmetic, parentheses can change how operators are grouped.
File: gawk.info, Node: Case-sensitivity, Prev: Regexp Operators, Up: Regexp
Case-sensitivity in Matching
----------------------------
Case is normally significant in regular expressions, both when
matching ordinary characters (i.e., not metacharacters), and inside
character sets. Thus a `w' in a regular expression matches only a
lower case `w' and not an upper case `W'.
The simplest way to do a case-independent match is to use a character
set: `[Ww]'. However, this can be cumbersome if you need to use it
often; and it can make the regular expressions harder for humans to
read. There are two other alternatives that you might prefer.
One way to do a case-insensitive match at a particular point in the
program is to convert the data to a single case, using the `tolower' or
`toupper' built-in string functions (which we haven't discussed yet;
*note Built-in Functions for String Manipulation: String Functions.).
For example:
tolower($1) ~ /foo/ { ... }
converts the first field to lower case before matching against it.
Another method is to set the variable `IGNORECASE' to a nonzero
value (*note Built-in Variables::.). When `IGNORECASE' is not zero,
*all* regexp operations ignore case. Changing the value of
`IGNORECASE' dynamically controls the case sensitivity of your program
as it runs. Case is significant by default because `IGNORECASE' (like
most variables) is initialized to zero.
x = "aB"
if (x ~ /ab/) ... # this test will fail
IGNORECASE = 1
if (x ~ /ab/) ... # now it will succeed
In general, you cannot use `IGNORECASE' to make certain rules
case-insensitive and other rules case-sensitive, because there is no way
to set `IGNORECASE' just for the pattern of a particular rule. To do
this, you must use character sets or `tolower'. However, one thing you
can do only with `IGNORECASE' is turn case-sensitivity on or off
dynamically for all the rules at once.
`IGNORECASE' can be set on the command line, or in a `BEGIN' rule.
Setting `IGNORECASE' from the command line is a way to make a program
case-insensitive without having to edit it.
The value of `IGNORECASE' has no effect if `gawk' is in
compatibility mode (*note Invoking `awk': Command Line.). Case is
always significant in compatibility mode.
File: gawk.info, Node: Comparison Patterns, Next: Boolean Patterns, Prev: Regexp, Up: Patterns
Comparison Expressions as Patterns
==================================
"Comparison patterns" test relationships such as equality between
two strings or numbers. They are a special case of expression patterns
(*note Expressions as Patterns: Expression Patterns.). They are written
with "relational operators", which are a superset of those in C. Here
is a table of them:
`X < Y'
True if X is less than Y.
`X <= Y'
True if X is less than or equal to Y.
`X > Y'
True if X is greater than Y.
`X >= Y'
True if X is greater than or equal to Y.
`X == Y'
True if X is equal to Y.
`X != Y'
True if X is not equal to Y.
`X ~ Y'
True if X matches the regular expression described by Y.
`X !~ Y'
True if X does not match the regular expression described by Y.
The operands of a relational operator are compared as numbers if they
are both numbers. Otherwise they are converted to, and compared as,
strings (*note Conversion of Strings and Numbers: Conversion., for the
detailed rules). Strings are compared by comparing the first character
of each, then the second character of each, and so on, until there is a
difference. If the two strings are equal until the shorter one runs
out, the shorter one is considered to be less than the longer one.
Thus, `"10"' is less than `"9"', and `"abc"' is less than `"abcd"'.
The left operand of the `~' and `!~' operators is a string. The
right operand is either a constant regular expression enclosed in
slashes (`/REGEXP/'), or any expression, whose string value is used as
a dynamic regular expression (*note How to Use Regular Expressions:
Regexp Usage.).
The following example prints the second field of each input record
whose first field is precisely `foo'.
awk '$1 == "foo" { print $2 }' BBS-list
Contrast this with the following regular expression match, which would
accept any record with a first field that contains `foo':
awk '$1 ~ "foo" { print $2 }' BBS-list
or, equivalently, this one:
awk '$1 ~ /foo/ { print $2 }' BBS-list
File: gawk.info, Node: Boolean Patterns, Next: Expression Patterns, Prev: Comparison Patterns, Up: Patterns
Boolean Operators and Patterns
==============================
A "boolean pattern" is an expression which combines other patterns
using the "boolean operators" "or" (`||'), "and" (`&&'), and "not"
(`!'). Whether the boolean pattern matches an input record depends on
whether its subpatterns match.
For example, the following command prints all records in the input
file `BBS-list' that contain both `2400' and `foo'.
awk '/2400/ && /foo/' BBS-list
The following command prints all records in the input file
`BBS-list' that contain *either* `2400' or `foo', or both.
awk '/2400/ || /foo/' BBS-list
The following command prints all records in the input file
`BBS-list' that do *not* contain the string `foo'.
awk '! /foo/' BBS-list
Note that boolean patterns are a special case of expression patterns
(*note Expressions as Patterns: Expression Patterns.); they are
expressions that use the boolean operators. *Note Boolean Expressions:
Boolean Ops, for complete information on the boolean operators.
The subpatterns of a boolean pattern can be constant regular
expressions, comparisons, or any other `awk' expressions. Range
patterns are not expressions, so they cannot appear inside boolean
patterns. Likewise, the special patterns `BEGIN' and `END', which
never match any input record, are not expressions and cannot appear
inside boolean patterns.
File: gawk.info, Node: Expression Patterns, Next: Ranges, Prev: Boolean Patterns, Up: Patterns
Expressions as Patterns
=======================
Any `awk' expression is also valid as an `awk' pattern. Then the
pattern "matches" if the expression's value is nonzero (if a number) or
nonnull (if a string).
The expression is reevaluated each time the rule is tested against a
new input record. If the expression uses fields such as `$1', the
value depends directly on the new input record's text; otherwise, it
depends only on what has happened so far in the execution of the `awk'
program, but that may still be useful.
Comparison patterns are actually a special case of this. For
example, the expression `$5 == "foo"' has the value 1 when the value of
`$5' equals `"foo"', and 0 otherwise; therefore, this expression as a
pattern matches when the two values are equal.
Boolean patterns are also special cases of expression patterns.
A constant regexp as a pattern is also a special case of an
expression pattern. `/foo/' as an expression has the value 1 if `foo'
appears in the current input record; thus, as a pattern, `/foo/'
matches any record containing `foo'.
Other implementations of `awk' that are not yet POSIX compliant are
less general than `gawk': they allow comparison expressions, and
boolean combinations thereof (optionally with parentheses), but not
necessarily other kinds of expressions.
File: gawk.info, Node: Ranges, Next: BEGIN/END, Prev: Expression Patterns, Up: Patterns
Specifying Record Ranges with Patterns
======================================
A "range pattern" is made of two patterns separated by a comma, of
the form `BEGPAT, ENDPAT'. It matches ranges of consecutive input
records. The first pattern BEGPAT controls where the range begins, and
the second one ENDPAT controls where it ends. For example,
awk '$1 == "on", $1 == "off"'
prints every record between `on'/`off' pairs, inclusive.
A range pattern starts out by matching BEGPAT against every input
record; when a record matches BEGPAT, the range pattern becomes "turned
on". The range pattern matches this record. As long as it stays
turned on, it automatically matches every input record read. It also
matches ENDPAT against every input record; when that succeeds, the
range pattern is turned off again for the following record. Now it
goes back to checking BEGPAT against each record.
The record that turns on the range pattern and the one that turns it
off both match the range pattern. If you don't want to operate on
these records, you can write `if' statements in the rule's action to
distinguish them.
It is possible for a pattern to be turned both on and off by the same
record, if both conditions are satisfied by that record. Then the
action is executed for just that record.
File: gawk.info, Node: BEGIN/END, Next: Empty, Prev: Ranges, Up: Patterns
`BEGIN' and `END' Special Patterns
==================================
`BEGIN' and `END' are special patterns. They are not used to match
input records. Rather, they are used for supplying start-up or
clean-up information to your `awk' script. A `BEGIN' rule is executed,
once, before the first input record has been read. An `END' rule is
executed, once, after all the input has been read. For example:
awk 'BEGIN { print "Analysis of `foo'" }
/foo/ { ++foobar }
END { print "`foo' appears " foobar " times." }' BBS-list
This program finds the number of records in the input file `BBS-list'
that contain the string `foo'. The `BEGIN' rule prints a title for the
report. There is no need to use the `BEGIN' rule to initialize the
counter `foobar' to zero, as `awk' does this for us automatically
(*note Variables::.).
The second rule increments the variable `foobar' every time a record
containing the pattern `foo' is read. The `END' rule prints the value
of `foobar' at the end of the run.
The special patterns `BEGIN' and `END' cannot be used in ranges or
with boolean operators (indeed, they cannot be used with any operators).
An `awk' program may have multiple `BEGIN' and/or `END' rules. They
are executed in the order they appear, all the `BEGIN' rules at
start-up and all the `END' rules at termination.
Multiple `BEGIN' and `END' sections are useful for writing library
functions, since each library can have its own `BEGIN' or `END' rule to
do its own initialization and/or cleanup. Note that the order in which
library functions are named on the command line controls the order in
which their `BEGIN' and `END' rules are executed. Therefore you have
to be careful to write such rules in library files so that the order in
which they are executed doesn't matter. *Note Invoking `awk': Command
Line, for more information on using library functions.
If an `awk' program only has a `BEGIN' rule, and no other rules,
then the program exits after the `BEGIN' rule has been run. (Older
versions of `awk' used to keep reading and ignoring input until end of
file was seen.) However, if an `END' rule exists as well, then the
input will be read, even if there are no other rules in the program.
This is necessary in case the `END' rule checks the `NR' variable.
`BEGIN' and `END' rules must have actions; there is no default
action for these rules since there is no current record when they run.
File: gawk.info, Node: Empty, Prev: BEGIN/END, Up: Patterns
The Empty Pattern
=================
An empty pattern is considered to match *every* input record. For
example, the program:
awk '{ print $1 }' BBS-list
prints the first field of every record.