home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Fresh Fish 2
/
FFMCD02.bin
/
new
/
dev
/
misc
/
cweb
/
examples
/
commonwords.w
next >
Wrap
Text File
|
1993-12-21
|
28KB
|
730 lines
% COMMONWORDS example of WEB/CWEB portability. This documentation was
% originally published in the ``Communications of the ACM'', Volume 29,6
% (June 1986), and later in the book ``Literate Programming'', October 1991,
% where it appeared as an example of WEB programming for Pascal by Donald
% E. Knuth. It has been translated into CWEB by Andreas Scherer to show
% the ease of portability between Pascal/WEB and C/CWEB. As little changes
% as possible were made, so that the module numbering of the Pascal and
% the C versions are virtually identical. Restrictions of Pascal mentioned
% in the WEB source were removed and the features of C were used instead.
% This program is distributed WITHOUT ANY WARRANTY, express or implied.
% WEB Version --- Don Knuth, 1984
% CWEB Version --- Andreas Scherer, 1993
% Copyright (c) 1993 Andreas Scherer
% Permission is granted to make and distribute verbatim copies of this
% document provided that the copyright notice and this permission notice
% are preserved on all copies.
% Permission is granted to copy and distribute modified versions of this
% document under the conditions for verbatim copying, provided that the
% entire resulting derived work is distributed under the terms of a
% permission notice identical to this one.
\font\csc=cmcsc10
\def\CWeb{{\tt CWEB}}
\def\WEB{{\tt WEB}}
\def\Pascal{{\csc Pascal}}
\def\title{COMMONWORDS (C Version 1.1)}
\def\topofcontents{\null\vfill
\centerline{\titlefont Counting common word frequencies}
\vskip15pt
\centerline{(C Version 1.1)}
\vfill}
\def\botofcontents{\vfill
\noindent Copyright \copyright\ 1993 Andreas Scherer
\bigskip
\noindent Permission is granted to make and distribute verbatim copies of
this document provided that the copyright notice and this permission
notice are preserved on all copies.
\smallskip
\noindent Permission is granted to copy and distribute modified versions
of this document under the conditions for verbatim copying, provided that
the entire resulting derived work is distributed under the terms of a
permission notice identical to this one.}
@* Introduction. The purpose of this program is to solve the following
problem posed by Jon Bentley@^Bentley, Jon Louis@>:
{\narrower\noindent Given a text file and an integer $k$, print the $k$
most common words in the file (and the number of their occurrences) in
decreasing frequency.\par}
Jon intentionally left the problem somewhat vague, but he stated that ``a
user should be able to find the one hundred most frequent words in a
twenty-page technical paper (roughly a 50K byte file) without undue
emotional trauma.''
Let us agree that a {\it word\/} is a sequence of one or more contiguous
letters; {\tt "Bentley"} is a word, but {\tt "ain't"} isn't. The sequence
of letters should be maximal, in the sense that it cannot be lengthened
without including a nonletter. Uppercase letters are considered equivalent
to their lowercase counterparts, so that the words {\tt "Bentley"} and {\tt
"BENTLEY"} and {\tt "bentley"} are essentially identical.
The given problem still isn't well defined, for the file might contain more
than $k$ words, all of the same frequency; or there might not even be as
many as $k$ words. Let's be more precise: The most common words are to be
printed in order of decreasing frequency, with words of equal frequency
listed in alphabetic order. Printing should stop after $k$ words have been
output, if more than $k$ words are present.
@ The |stdin| file is assumed to contain the given text. If it
begins with a positive decimal number (preceded by optional blanks), that
number will be the value of $k$; otherwise we shall assume that $k=100$.
Answers will be sent to the |stdout| file.
@d default_k 100 /* use this value if $k$ isn't otherwise specified */
@ Besides solving the given problem, this program is supposed to be an
example of the {\tt CWEB} system, for people who know some \Cee\ but who
have never seen {\tt CWEB} before. Here is an outline of the program to be
constructed:
@^Differences between \Pascal\ and \Cee@>
@d FALSE 0
@d TRUE 1
@c
#include <stdio.h>
#include <stdlib.h>
typedef unsigned char boolean; /* for logical switches */
@<Type declarations@>@/
@<Global variables@>@/
@<Procedures for initialization@>@/
@<Procedures for input and output@>@/
@<Procedures for data manipulation@>@/
@<The main program@>@/
@ The main idea of the {\tt CWEB} approach is to let the program grow in
natural stages, with its parts presented in roughly the order that they
might have been written by a programmer who isn't especially clairvoyant.
For example, each global variable will be introduced when we first know
that it is necessary or desirable; the {\tt CWEB} system will take care of
collecting these declarations into the proper place. We already know about
one global variable, namely the number that Bentley called $k$. Let us
give it the more descriptive name |max_words_to_print|.
@<Global variables@>=
unsigned int max_words_to_print; /* at most this many words will be printed */
@ As we introduce new global variables, we'll often want to give them
certain starting values. This will be done by the |initialize| procedure,
whose body will consist of various pieces of code to be specified when we
think of particular kinds of initialization.
@<Procedures for initialization@>=
void initialize(void)
{
int i; /* all-purpose index for initializations */
@<Set initial values@>@;
}
@ The {\tt CWEB} system, which may be thought of as a preprocessor for
\Cee, includes a macro definition facility so that portable programs are
easier to write. For example, we have already defined `|default_k|' to be
100. Here are two more examples of {\tt CWEB} macros; they allow us to
write, e.g., `|incr(count[p])|' as a convenient abbreviation for the
statement `|count[p]=count[p]+1|'.
@d incr(A) /* increment a variable */
++A
@d decr(A) /* decrement a variable */
--A
@ Originally this program was written in the \Pascal\ {\tt WEB} language.
In difference to \Cee, \Pascal\ uses labels and |goto| statements when
abrupting procedures. This isn't necessary for \Cee\ programs; they
already know the |return| command, so this program will be totally
|goto|-free.
@^Differences between \Pascal\ and \Cee@>
@* Strategic considerations. What algorithms and data structures should be
used for Bentley's problem? Clearly we need to be able to recognize
different occurrences of the same word, so some sort of internal dictionary
is necessary. There's no obvious way to decide that a particular word of
the input cannot possibly be in the final set, until we've gotten very near
the end of the file; so we might as well remember every word that appears.
There should be a frequency count associated with each word, and we will
eventually want to run through the words in order of decreasing frequency.
But there's no need to keep these counts in order as we read through the
input, since the order matters only at the end.
Therefore it makes sense to structure our program as follows:
@<The main program@>=
void main(void)
{
initialize();
@<Establish the value of |max_words_to_print|@>@;
@<Input the text, maintaining a dictionary with frequency counts@>@;
@<Sort the dictionary by frequency@>@;
@<Output the results@>@;
}
@* Basic input routines. Let's switch to a bottom-up approach now, by
writing some of the procedures that we know will be necessary sooner or
later. Then we'll have some confidence that our program is taking shape,
even though we haven't decided yet how to handle the searching or the
sorting. It will be nice to get the messy details of \Cee\ input out of
the way and off our minds.
Here's a function that reads an optional positive integer, returning zero
if none is present at the beginning of the current line.
@<Procedures for input and output@>=
int read_int(void)
{
int n=0; /* the accumulated value */
char c; /* the character from the input we're reading */
if( (c=fgetc(stdin)) != EOF ) {
for(; (c=='\n') || (c==' ') || (c=='\t'); c=fgetc(stdin));
while(c>='0' && c<='9') {
n = 10*n +