% This file is part of the Stanford GraphBase (c) Stanford University 1992 \def\title{GB\_\thinspace SORT} @i boilerplate.w %<< legal stuff: PLEASE READ IT BEFORE MAKING ANY CHANGES! @* Introduction. This short GraphBase module provides a simple utility routine called |gb_linksort|, which is used in many of the other programs. @d NULL 0 /* |NULL| */ @p #include "gb_flip.h" /* we need to use the random number generator */ @@; @ @ Most of the graphs obtained from GraphBase data are parameterized, so that different effects can be obtained easily from the same underlying body of information. In many cases the desired graph is determined by selecting the ``heaviest'' vertices according to some notion of ``weight,'' and/or by taking a random sample of vertices. For example, the GraphBase routine |words(n,wt_vector,wt_threshold,seed)| creates a graph based on the |n| most common words of English, where common-ness is determined by a given weight vector. When several words have equal weight, we want to choose between them at random. In particular, this means that we can obtain a completely random choice of words if the weight vector assigns the same weight to each word. The |gb_linksort| routine is a convenient tool for this purpose. It takes a given linked list of nodes and shuffles their link fields so that the nodes can be read in decreasing order of weight, and so that equal-weight nodes appear in random order. {\sl Note: The random number generator of |gb_flip| must be initialized before |gb_linksort| is called.} The nodes sorted by |gb_linksort| can be records of any structure type, provided only that the first field is `|long| |key|' and the second field is `|struct| \\{this\_struct\_type} |*link|'. Further fields are not examined. The |node| type defined below is the simplest possible example of such a structure. Sorting is done by means of the |key| fields, which must each contain nonnegative integers less than $2^{31}$. After sorting is complete, the data will appear in 128 linked lists: |gb_sorted[127]|, |gb_sorted[126]|, \dots, |gb_sorted[0]|. Reading through these lists with a routine such as $$\vcenter{\halign{#\hfil\cr |{@+int j; @+node *p;|\cr \quad|for (j=127; j>=0; j--)|\cr \qquad|for (p=(node*)gb_sorted[j]; p; p=p->link)|\cr \qquad\qquad\\{look\_at}|(p)|;\cr }}$$ will look at the nodes in decreasing order of weight, as desired. In fact, all nodes whose keys are in the range $j\cdot2^{24}\le|key|<(j+1)\cdot2^{24}$ will appear in list |gb_sorted[j]|. Therefore the results will all be found in the single list |gb_sorted[0]|, if all the keys are strictly less than~$2^{24}$. @= typedef struct node_struct { long key; /* a numeric quantity, assumed nonnegative */ struct node_struct *link; /* the next node on a list */ } node; /* applications of |gb_linksort| may have other fields after |link| */ @ In the header file, |gb_sorted| is declared to be an array of pointers to |char|, since nodes may have different types in different applications. User programs should cast |gb_sorted| to the appropriate type as in the example above. @(gb_sort.h@>= extern void gb_linksort(); /* procedure to sort a linked list */ extern char* gb_sorted[]; /* the results of |gb_linksort| */ @ Six passes of a radix sort, using radix 256, will accomplish the desired objective rather quickly. (See, for example, Algorithm 5.2.5R in {\sl Sorting and Searching}.) The first two passes use random numbers instead of looking at the key fields, thereby effectively extending the keys so that nodes with equal keys will appear in reasonably random order. We move the nodes back and forth between two arrays of lists: the external array |gb_sorted| and a private array called |alt_sorted|. @= node *gb_sorted[256]; /* external bank of lists, for even-numbered passes */ static node *alt_sorted[256]; /* internal bank of lists, for odd-numbered passes */ @ So here we go with six passes over the data. @= void gb_linksort(l) node *l; {@+register int k; /* index to destination list */ register node **pp; /* current place in list of pointers */ register node *p, *q; /* pointers for list manipulation */ @; @; @; @; @; @; } @ @= for (pp=alt_sorted+255; pp>=alt_sorted; pp--) *pp=NULL; /* empty all the destination lists */ for (p=l; p; p=q) { k=gb_next_rand() >> 23; /* extract the eight most significant bits */ q=p->link; p->link=alt_sorted[k]; alt_sorted[k]=p; } @ @= for (pp=gb_sorted+255; pp>=gb_sorted; pp--) *pp=NULL; /* empty all the destination lists */ for (pp=alt_sorted+255; pp>=alt_sorted; pp--) for (p=*pp; p; p=q) { k=gb_next_rand() >> 23; /* extract the eight most significant bits */ q=p->link; p->link=gb_sorted[k]; gb_sorted[k]=p; } @ @= for (pp=alt_sorted+255; pp>=alt_sorted; pp--) *pp=NULL; /* empty all the destination lists */ for (pp=gb_sorted+255; pp>=gb_sorted; pp--) for (p=*pp; p; p=q) { k=p->key & 0xff; /* extract the eight least significant bits */ q=p->link; p->link=alt_sorted[k]; alt_sorted[k]=p; } @ Here we must read from |alt_sorted| from 0 to 255, not from 255 to 0, to get the desired final order. (Each pass reverses the order of the lists; it's tricky, but it works.) @= for (pp=gb_sorted+255; pp>=gb_sorted; pp--) *pp=NULL; /* empty all the destination lists */ for (pp=alt_sorted; ppkey >> 8) & 0xff; /* extract the next eight bits */ q=p->link; p->link=gb_sorted[k]; gb_sorted[k]=p; } @ @= for (pp=alt_sorted+255; pp>=alt_sorted; pp--) *pp=NULL; /* empty all the destination lists */ for (pp=gb_sorted+255; pp>=gb_sorted; pp--) for (p=*pp; p; p=q) { k=(p->key >> 16) & 0xff; /* extract the next eight bits */ q=p->link; p->link=alt_sorted[k]; alt_sorted[k]=p; } @ The most significant bits will lie between 0 and 127, because we assumed that the keys are nonnegative and less than $2^{31}$. (A similar routine would be able to sort signed integers, or unsigned long integers, but the \Cee\ code would not then be portable.) @= for (pp=gb_sorted+255; pp>=gb_sorted; pp--) *pp=NULL; /* empty all the destination lists */ for (pp=alt_sorted; ppkey >> 24) & 0xff; /* extract the most significant bits */ q=p->link; p->link=gb_sorted[k]; gb_sorted[k]=p; } @* Index. Here is a list that shows where the identifiers of this program are defined and used.