% This file is part of the Stanford GraphBase (c) Stanford University 1992
\def\title{GB\_WORDS}
@i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES!
\font\logosl=logosl10
\prerequisites{GB\_\thinspace GRAPH}{GB\_\thinspace IO}
@* Introduction. This GraphBase module provides two external subroutines:
$$\vcenter{\halign{#\hfil\cr
|words|, a routine that creates a graph based on five-letter words;\cr
|find_word|, a routine that looks for a given vertex in such a graph.\cr}}$$
Examples of the use of these routines can be found in the demo programs
called |word_components| and |ladders|.
@(gb_words.h@>=
extern Graph *words();
extern Vertex *find_word();
@ The subroutine call `|words(n,wt_vector,wt_threshold,seed)|'
constructs a graph based on the five-letter words in \.{words.dat}.
Each vertex of the graph corresponds to a single five-letter word. Two
words are adjacent in the graph if they are the same except in one
letter position. For example, `\.{words}' is adjacent to other words such as
`\.{cords}', `\.{wards}', `\.{woods}', `\.{worms}', and `\.{wordy}'.
The constructed graph has at most |n| vertices; indeed, it has exactly
|n| vertices if there are enough qualifying words. A word `qualifies'
if its weight is |wt_threshold| or more, where the `weight' is
computed from a table pointed to by~|wt_vector| according to rules
described below. (If parameter~|wt_vector|
is |NULL|, i.e., \.{NULL}, default weights are used.) The fourth parameter,
|seed|, is the seed of a random number generator.
All words of \.{words.dat} are sorted by weight. The first vertex of
the graph will be the word of largest
weight, the second vertex will have second-largest weight, and so on.
Words of equal weight will appear in pseudo-random order, as determined
by the value of |seed| in a system-independent fashion.
The first |n| words in order of decreasing weight are chosen to be
vertices of the graph. However, if fewer than |n| words have weight |>=
wt_threshold|, the graph will contain only the words that qualify. In
such cases the graph will have fewer than |n| vertices---possibly none at all.
Exception: The special case |n=0| is equivalent to the case when |n|
has been set to the highest possible value. It causes all qualifying
words to appear.
@ Every word in \.{words.dat} has been classified as `common' (\.*), `advanced'
(\.+), or `unusual' (\.\ ). Each word has also been assigned seven
frequency counts $c_1$, \dots,~$c_7$, separated by commas; these counts show
how often the word has occurred in different publication contexts:
$$\vcenter{\halign{$c_#$ times in &#\hfil\cr
1&the American Heritage Intermediate Corpus of elementary school material;\cr
2&the Brown Corpus of reading material from America;\cr
3&the Lancaster-Oslo/Bergen Corpus of reading material from Britain;\cr
4&the Melbourne-Surrey Corpus of newspaper material from Australia;\cr
5&the Revised Standard Version of the Bible;\cr
6&{\sl The \TeX book\/} and {\sl The {\logosl METAFONT\kern1pt}book\/}
by D. E. Knuth;\cr
7&{\sl Concrete Mathematics\/} by Graham, Knuth, and Patashnik.\cr}}$$
For example, one of the entries in \.{words.dat} is
$$\.{happy*774,92,121,2,26,8,1}$$
indicating a common word with $c_1=774$, \dots, $c_7=1$.
Parameter |wt_vector| points to an array of nine integers
$(a,b,w_1,\ldots,w_7)$.
The weight of each word is computed from these nine numbers by using the
formula
$$c_1w_1+\cdots+c_7w_7+
\cases{a,&if the word is `common';\cr
b,&if the word is `advanced';\cr
0,&if the word is `unusual'.\cr}$$
The components of |wt_vector| must be chosen so that
$$\max\bigl(\vert a\vert, \vert b\vert\bigr)
+ C_1\vert w_1\vert + \cdots +C_7\vert w_7\vert < 2^{30},$$
where $C_j$ is the maximum value of $c_j$ in the file; this restriction
ensures that the |words| procedure will produce the same results on all
computer systems.
@ The maximum frequency counts actually present are $C_1=15194$, $C_2=3560$,
$C_3=4467$, $C_4=460$, $C_5=6976$, $C_6=756$, and $C_7=362$; these can be
found in the entries for the common words `\.{shall}', `\.{there}',
`\.{which}', and `\.{would}'.
The default weights are $a=100$, $b=10$, $c_1=4$, $c_2=c_3=2$, $c_4=c_5=
c_6=c_7=1$.
File \.{words.dat} contains 5678 words, of which 3294 are `common', 1189 are
`advanced', and 1195 are `unusual'. Included among the unusual words are
823 having $c_1=\cdots=c_7=0$; such words
will always have weight zero, regardless of the weight vector parameter.
@=
static int max_c[]={15194,3560,4467,460,6976,756,362};
/* maximum counts $C_j$ */
static int default_wt_vector[]={100,10,4,2,2,1,1,1,1};
/* use this if |wt_vector=NULL| */
@ Examples: If you call |words(2000,NULL,0,0)|, you get a graph with
2000 of the most common five-letter words of English, using the
default weights. The GraphBase programs are designed to be
system-independent, so that identical graphs will be obtained by
everybody who asks for |words(2000,NULL,0,0)|. Equivalent experiments
on algorithms for graph manipulation can therefore be performed by
researchers in different parts of the world.
The subroutine call |words(2000,NULL,0,s)| will produce slightly
different graphs when the random seed |s| varies, because some words
have equal weight. However, the graph for any particular value of~|s|
will be the same on all computers. The seed value can be any integer
in the range $0\le s<2^{31}$.
Suppose you call |words(6000,w,1,0)|, with |w| defined by the \Cee\ declaration
$$\hbox{|int w[9] = {1};|}$$
this means that $a=1$ and $b=w_1=\cdots=w_7=0$. Therefore you'll get a graph
containing only the 3294 `common' words. Similarly, it's possible to obtain
only the $3294+1189=4483$ non-`unusual' words, by specifying the weight vector
$$\hbox{|int w[9] = {1,1};|}$$
this makes $a=b=1$ and $w_1=\cdots=w_7=0$. In both of these examples, the
qualifying words all have weight~1, so the vertices of the graph will appear
in pseudo-random order.
If |w| points to an array of nine 0's, the call |words(n,w,0,s)| gives a
random sample of |n| words, depending on |s| in a system-independent fashion.
If the entries of the weight vector are all nonnegative, and if the
weight threshold is zero, every word of \.{words.dat} will qualify. Thus
you will obtain a graph with $\min(n,5678)$ vertices.
If |w| points to an array with {\it negative\/} weights, the call
|words(n,w,-0x7fffffff,0)| selects |n| of the {\it least\/} common
words in \.{words.dat}.
@ If the |words| routine encounters a problem, it returns |NULL|, after putting
a code number into the external variable |panic_code|. This code number
identifies the type of failure. Otherwise |words| returns a pointer to the
newly created graph, which will be represented with the data structures
explained in |gb_graph|. (The external variable |@!panic_code| is itself
defined in |gb_graph|.)
@d panic(c) @+{@+gb_free(node_blocks);
panic_code=c;@+gb_alloc_trouble=0;@+return NULL;@+}
@#
@f Graph int /* |gb_graph| defines the |Graph| type and a few others */
@f Vertex int
@f Area int
@ Now let's get going on the program. The \Cee\ file \.{gb\_words.c} begins
as follows:
@p
#include "gb_io.h" /* we will use the |gb_io| routines for input */
#include "gb_flip.h" /* we will use the |gb_flip| routines for random numbers */
#include "gb_graph.h" /* we will use the |gb_graph| data structures */
#include "gb_sort.h" /* and |gb_linksort| for sorting */
@#
@@;
@@;
@@;
@#
Graph *words(n,wt_vector,wt_threshold,seed)
unsigned n; /* maximum number of vertices desired */
int wt_vector[]; /* pointer to array of weights */
long wt_threshold; /* minimum qualifying weight */
long seed; /* random number seed */
{@+@@;
gb_init_rand(seed);
@;
@;
@;
if (gb_alloc_trouble) {
gb_recycle(new_graph);
panic(alloc_fault); /* oops, we ran out of memory somewhere back there */
}
return new_graph;
}
@ @=
Graph *new_graph; /* the graph constructed by |words| */
@* Validating the weights. The first job that |words| needs to tackle is
comparatively trivial:
We want to verify the condition
$$\max\bigl(\vert a\vert, \vert b\vert\bigr)
+ C_1\vert w_1\vert + \cdots +C_7\vert w_7\vert < 2^{30}.\eqno(*)$$
But this proves to be an interesting exercise in ``portable
\Cee\ programming,'' because we don't want to risk integer overflow.
Our approach will be to do the
calculation first in floating point arithmetic, thereby ruling out cases
that are clearly unacceptable; once that test is passed, we will safely be
able to test the condition with ordinary integer arithmetic. Floating
point arithmetic is system dependent, but we will use it carefully so as to
obtain system-independent results.
@=
if (!wt_vector) wt_vector=default_wt_vector;
else {@+register double flacc;
register int *p,*q;
register long acc;
@