|
|
|
SAS Tabulate
 This document, based on a SAS session (1992-1993) of KUCC (Korea University
Computer Club) and a statistics seminar (1998) of the Department
of Public Administration, the Graduate School of Korea University,
is intended to provide new
and old SAS users with brief and essential information of the SAS
DATA STEP. It was primarily
tested with SAS release 6.12, but there is no difference in grammars,
except detailed system specifications and requirements, across the
releases.
This
document can be copied and distributed only for the sake of scholarly
work without the prior written
permission of the author.
No part of the document may be reproduced, modified, or changed.
This document may not be commercially used in any situations. Finally,
there can be some flaws in the document, but the author will not
be responsible for any problems caused by them.
REFERENCES
- SAS Institute Inc. (1990) SAS
Guide to the TABULATE Processing. 2nd ed. NC: SAS Institute Inc.
- SAS Institute Inc. (1990) SAS
Guide to the REPORT Procedure: Usage and Reference. 1st ed.
NC: SAS Institute Inc.
- Korea University Computer Center. (1980's). SAS
Workshop Manual
- Kim, Choong Ryun.
(1993) The Statistics
Package Called SAS: Focusing on the Statistics Analysis and
Marketing Research Methods. Seoul: Data Research.
- Cho, In Ho. (1993).
The SAS Lecture
and Statistics Cunsulting. Seoul: Hanwha Institute of Economics
What is
the SAS Tabulate?
- PROC TABULATE "summarizes values for all observations in
a data set and prints the summaries in table format. It can produce tables with one to three
dimensions, and within each dimension it allows multiple variables
to be reported one after another or hierarchically."
- "PROC TABULATE constructs
tables of descriptive statistics from class variables, analysis
variables, and keywords for statistics." It "treats
the values of variables used to create a table in one of the
two ways: as discrete classifications of data (class variables)
or as continuous values for which you can request descriptive
statistics (analysis variables)."
- If you want to directly generate
complicated and customized tables without a spreadsheet program,
the TABULATE will be best alternative. There might be some difficulties
being familiar with the grammars of TABULATE, but it will highly
worth investing some time and efforts to do that (I never agree
with Mr. Cho!). As a good tool for the customized tables, TABULATE
is equipped with high flexibility, various functionality, and
diverse formats.
- Although FREQ provides
similar features including chi-square test (see the below cross
table), it cannot support various statistics (e.g., mean, sum,
and standard deviation). And PROC
REPORT provides excellent
functionality of generating professional reports, but its focus
is on dealing efficiently with huge data set, not producing
table-based outputs.

Available
Statistics
N and NMISS
do not require any nonmissing observations.
SUM, MEAN,
MAX, MIN, RANGE, USS, and CSS require at ltast one nonmissing
observation.
VAR, STD,
STDERR, CV, T, and PRT require at least two observations.
Statistics |
Definition |
Notes |
CSS |
The sum of
squares corrected for the mean |
|
CV |
The percent
coefficient of variation |
MEAN<>0 |
MAX |
The maximum
value |
|
MEAN |
The arithmetic
mean |
|
MIN |
The minimum
value |
|
N |
the number
of observations with nonmissing variable values (frequency) |
|
NMISS |
The number
of observations with missing variable values |
|
PCTN |
The percentage
that one frequency represents of another frequency |
|
PCTSUM |
The percentage
that one sum represents of another sum |
|
PRT |
The two-tailed
p-value for Student's t with n-1 degree of freedom |
STD>0 |
RANGE |
The range=
MAX-MIN |
|
STD |
The standard
deviation |
|
STDERR |
The standard
error of the mean |
|
SUM |
The weighted
sum |
|
SUMWGT |
The sum of
weight |
|
USS |
The uncorrected
sum of squares |
|
T |
Student's t
for H0: population mean=0 |
STD>0 |
VAR |
The variance |
|
ALL |
|
Total
value |
Basic Principles
The TABULATE
statement is always accompanied by one or more TABLE statements
specifying the tables to be produced. You need to specify the
variables to be used in the table in either a VAR statement
(for analysis variables) or a CLASS statement (for class variables),
but not both.
Class variable means "any variable, numeric or
character, that you want to use to classify data into groups
or categories of information." Normally each class variable
has a small number of discrete values or unique levels. It can
have character, integer, or even decimal values, but the number
of unique values should be limited. CLASS statement
is to "identify variables in the input data set as class
variables.
- Analysis variable means "any numeric variable for
which you want to compute statistics." It is a quantitative
or continuous variable as well as a qualitative or discrete
variable for which you want descriptive statistics. VAR statement is to identify analysis variables in the input data
set.
- Missing values: "If an observation contains missing
values for any variable listed in the CLASS statement, the observation
is not included in the table unless you specify the MISSING
option in the PROC TABULATE statement." Missing values
for class variables cause observations to be omitted (skipped)
from the analysis performed to produce tables. "If an observation
contains missing values for a variable listed in the VAR statement,
the value is omitted from calculations of all statistics except
N and NMISS; Missing value for analysis variable affect only
the statistics for those variables. A different type of missing
value that is often confused with missing values in observations
is the missing value used to represent empty table cells."
- "By default, PROC TABULATE
evaluates each page it prints and omits columns and rows for
combinations of class variable values that do not exist. To
change the default, you can specify the PRINTMISS option in
the TABLE statement, and TABULATE will print the same headings
for each subtable."
- TABULATE provides the PCTN and PCTSUM statistics to allow you to print the percentage of
the value in a single cell to the value in another cell or to
the total of a group of cells. You need to construct the PCTN
or PCTSUM expression using a denominator definition that describes
to TABULATE what categories of information should be summed
to arrive at the denominator. Without the denominator, TABULATE
automatically summarizes the values in all SUM cells (for the
PCTSUM denominator) or all N cells (for the PCTN denominator):
defaults of each.
- The format for specifying a
denominator definition is to enclose it in brackets <>
that is appended to the PCTN or PCTSUM statistcs (e.g., PCTN<gender*marital>).To
obtain the percentage of the value in one table cell to the
total for the column containing the cell, use all class variables
that define the row as the denominator definition (TABLE gender*marital, party*PCTN<gender*marital>;). For the percentage of the value in
one table cell to the total for the row containing the cell,
use all class variable that define the column as the denominator
definition (TABLE
gender, marital*party*PCTN<marital*party>;).
- When you concatenate ALL
with other elements in the column dimension, TABULATE prints
a separate column that summarizes the observations reported
in each row of the table.
- Removing levels of headings:
there are times when the multiple levels of heading for class
and analysis variables are not necessary in your tables. this
often occurs when you replace a default heading with more descriptive
text that actually includes the information in two level of
heading, table share='number of shares'*sum=' '; no cell for
sum title

Typical
Form
of the Tabulate
PROC
TABULATE DATA=korea F=12.; CLASS
gender marital; VAR
income; TABLE
gender,income*(N MEAN SUM STD MIN MAX)*(F=7.1)
LABEL income='Annual Income'
marital='Marital Status'; RUN; |
- PROC TABULATE DATA=korea;
It represents the beginning
of the TABULATE function.
- F=12.; It is a format option, which represents the cell format
of 12 digits without decimal point (default is 12.2).
- CLASS gender; It specifies classification variables
that are located in the left (rows). All classification variables
shown in the table must be specified beforehand.
- VAR income; It specifies variables to be analyzed
and to be located in the column.
- Table gender,income*(N MEAN
SUM STD MIN MAX); it
provides the form of a table to be generated.
- Comma divides the location of variables; the former gender
will be shown in the left (row) in the table and the latter
income will be located in the column position.
- income*(N MEAN SUM STD MIN
MAX); It specifies that
mean, sum, and total will be included in the income variable.
- *(F=7.1); It specifies the format of each cell,
total 7 digit with 1 decimal digit.
- LABEL income='Annual Income';
It define the label to
be shown in the output. Instead of 'income', 'Annual Income'
will be shown in the table.
- RUN; It tells the computer to begin the procedure.

Options
PROC
TABULATE
<option list>; CLASS...; VAR ...; TABLE ... <option list>; FORMAT...; LABEL...; KEYLABEL...;
BY ...; FREQ...; WEIGHT...; RUN; |
- PROC TABULATE <options> includes DATA=data set;
DEPTH=number; FORMAT=format name; FC or FORMCHAR<Index list>='string';
MISSING; NOSEPS; ORDER=order; and VARDEF=divisor.
- ORDER=FORMATTED (ordered by the formatted values); ORDER=DATA
(the order that the observations are read from the data set);
ORDER=FREQ (order the values of the class variables so the value
that occurs most frequently in the data set appears first);
ORDER=INTERNAL (ordered by the SORT procedure: defaults)
- The FORMCHAR=a string
of 11 characters. The default is FORMCHAR='|----|+---'; FORMCHAR='
11 blank space '; for no horizontal
and vertical line. FORMCHAR='4FBFACBFBC4F8F4FABBFBB'X; for IMB
1403 printer FORMCHAR='FABFACCCBCEB8FECABCBBB'X; for IBM 6670
printer
FORMCHAR='B3C4DAC2BFC3C5B2C0C1D9'X;
- MISSING considers missing values as valid levels for the
class variables. Unless the Missing option is specified, TABULATE
does not include observations with a missing value for one or
more class variables in the analysis.
- NOSEPS eliminates horizontal separator lines from the row
titles and body of the printed table except the column title
section of the table.
- VARDEF=divisor specifies the divisor to be used in
the calculation of the variances. If divisor is DF(default),
the degrees of Freedom (N-1) is used as the divisor.
- TABLE page expression, row-expression, column
expression /option list;
- /BOX=value specifies the text to be placed in the
empty box above the row titles. It allows you to move the page
heading into the top left box of the table or insert either
a variable name (or label) or a descriptive string in the box.
When BOX=_PAGE_, the page-dimension text appears in
the box. If the page-dimension text does not fit, it is placed
in its default position and the box is left empty. When BOX='sting',
the quoted string appears in the box. Any name, label, or quoted
string that does not fit is truncated. When BOX=variable, the
name or label of a variable appears in the box. (TABLE gender*marital, income
tax /BOX='Income and Tax';).
- /CONDENSE prints multiple logical pages on a single
physical page.
Operator |
Action |
Comma , |
Separates
dimensions of table and crosses elements across dimensions |
Asterisk * |
Crosses
elements within a dimension |
Blank space |
Concatenates
elements in a dimension |
Parentheses () |
Group
elements and associate operator with entire group |
Brackets <> |
Specify
denominator definitions |
Equal sign = |
Assigns
a label to a variable or statistic, or completes a format
modifier |
- /PRINTMISS species that row and column headings
are the same for all logical pages of the table.
- /ROW=spacing specifies whether all title elements
in a row crossing are allotted space even when they are blank.
When ROW=CONSTANT (OR CONST), the default, all row title elements
have space allotted to them; when ROW=FLOAT, the row title space
is divided equally among the nonblank title elements in the
crossing.
- /RTSPACE=number or RTS=number supplies an integer value
that species the number of print positions allotted to the headings
in the row dimension. the default value I on-fourth of the LINESIZE=value.
- Format USAGE in the TABLE statement: variable*F=format (income*F=6.1).
- Remove or change
variable name in the TABLE
statement (TABLE sum*stock=' ';)
- FORMAT variable list formats
the values of class variables used as headings in the page,
row, and column dimensions. It may be used in combination with
the PROC FORMAT; to group values of class variables. This statement
has no effect on either analysis variables or the content of
table cells.
PROC
FORMAT; VALUE
inc_FMT .='Income
Unknown' 0-999='Under
999' 1000-4999='1,000-4,999'
5000-9999='5,000-9,999'
OTHERS='10,000
and Over';
PROC TABULATE
DATA=korea F=12.; CLASS
...; VAR income...; FORMAT
income inc_FMT; TABLE
... RUN; |
- KEYLABEL keyword='label' ...; replace
text in the label anywhere the specified keyword is use, unless
another label is assigned in the TABLE statement (KEYLABEL MEAN='Average';).
- LABEL variable='label'; replaces
the name of the variable in the page, row, or column heading
where the variable appears. The maximum length of the label
(in both KEYLABEL and LABEL)is 40 characters.
- Hierarchical positioning: An Asterisk
'*' between
variables indicate statistics will be listed in a hierarchical
position. On the other hand, a SPACE between variables indicates statistics
of them will be listed in the parallel position, not in a hierarchical
position. For instance, 'TABLE gender marital, income*...' will
add two rows for the marital status below gender.

Frequency Cross
Table (One by One)
PROC
TABULATE; CLASS gender party; TABLE
gender ALL, party*(N PCTN<party>)*(F=4.) ALL*F=7.
/RTS=8; RUN; |
/RTS=8;
It is an option for adjusting
the cell length (Equivalent to /RTSPACE=8;).
You can compare
the above with the following result of FREQ procedure (PROC
FREQ; TABLE gender*party /NOCOL MISSING; RUN;).


ALL and Parentheses
Usage
PROC
TABULATE; CLASS gender marital party;
VAR income;
TABLE gender*marital, ALL*F=5. party*(N*F=5.
PCTN='%'*F=4.1) income*(MEAN*F=5. STD*F=5. PCTSUM<gender*marital>='%'
*F=4.1) /RTS=17 BOX='All Usage';
TABLE gender*marital
ALL*F=5., party*(N*F=5. PCTN='%'*F=4.1)
income*(MEAN*F=5. STD*F=5. PCTSUM<gender*marital ALL>='%' *F=4.1) /RTS=17
BOX='All Usage'; RUN; |

PROC
TABULATE; CLASS gender marital; VAR
income;
TABLE (gender
ALL)*marital*income=''*(MEAN*F=5.
STD*F=4.1) /RTS=10;
TABLE gender*(ALL
marital)*income=''*(MEAN*F=5.
STD*F=4.1) /RTS=10 ; RUN;
|
 

Two by
Two
Table (Two by Two)
PROC
TABULATE; CLASS gender marital party;
VAR income; LABEL income='Annual Income'
marital='Marital Status'; TABLE (gender
ALL)*(party ALL), marital*income*(N MEAN STD)*(F=5.)
ALL*F=7.; RUN; |
PROC
TABULATE; CLASS gender marital; VAR
income tax;
TABLE (gender*marital)*F=5.
(income tax)*(MEAN STD)*F=5. /RTS=17;
TABLE gender*F=5.*(marital*F=5. (income tax)*(MEAN STD)*F=5.) /RTS=17;
RUN; |
1.gif) 2.gif)

Three
Dimensional Table
PROC
TABULATE; CLASS gender marital party religion;
VAR income tax; TABLE gender ALL, marital
party, religion*(income*(MEAN
STD)*F=5.
tax*(MEAN STD)*F=5.)/RTS=10; RUN; |
- marital party Notice that the blank between the variables
causes the parallel arrangement of them. (Different from 'marital*party')
- income*(MEAN STD)*F=5.; The tabulate function allows a variety
of formats, including decimal point position and cell length,
in a table.
- Notice that there are two tables omitted here: one for Gender
M and the other for ALL.

PCTN and
PCTSUM
Usage
PROC
TABULATE; CLASS gender marital; VAR
income tax; TABLE gender*marital ALL,
(income tax)*(SUM*F=5. STD*F=5. PCTSUM<gender*marital
ALL>='%'*F=5.1)/RTS=17; RUN; |
- PCTSUM<gender*marital
ALL>; The percent
sum will be calculated by each cell combination of the gender
and marital status variables.
- ='%'*F=5.1; It specifies the title of the PCTSUM,
'%', and its format of '5.1'.
- For example, 'age='Official Age'*MEAN='Average Age';' allows you to change the names of the
variable and statistics. 'Offician Age' instead of 'age' and
'Average Age' instead of 'MEAN' will be shown in the table.
- Notice that the part of the right side is cut for the convenience's
sake. But it (table for Christianian) has the same fashion as
that of Buddhism.
PROC
TABULATE; CLASS gender marital; VAR
income tax;
TABLE gender
ALL, (marital ALL)*(income
tax)*F=4.*(SUM*F=4. PCTSUM<ALL gender>='%'*F=3.)
/RTS=8;
TABLE ALL gender,
(ALL marital)*(income
tax)*F=4.*(SUM*F=4. PCTSUM<ALL gender>='%'*F=3.)
/RTS=8;
|
 

Complex Table
PROC
TABULATE; CLASS gender marital party;
VAR income tax; TABLE gender*marital
ALL, party
ALL*(income*F=7.
tax*F=6.) /RTS=17 BOX='Default Test';
TABLE gender*marital
ALL, (party
ALL)*F=5.*(income*SUM=''*F=7.
tax*SUM=''*F=6.) /RTS=17 BOX='Sum of Each Cell';
RUN; |
- party ALL; Default statistics of a nominal variable
is N (frequency), so it will produce the number of observations
falling in the each cell. The statement is equivalent to party*N
ALL;
- income*F=7. tax*F=6. ; Default statistics of a ordinal or interval
variable is SUM, so it will produce the sum of observations
falling in the each cell. The statement is equivalent to 'income*SUM*F=7.
tax*SUM*F=6.;'.
- /BOX='Default Test'; specifies the location of the table's
name.
- SUM=' '; removes variable name in the table.
- The following is the result
of adding an option, ...(party
ALL)*F=5. * (income...

Normal Distribution
Table
(Adapted from Kim's book)
DATA
normal; DO
row=0.0 TO 3.4 BY 0.1; DO
column=0.00 TO 0.09 BY 0.01; z=row+column; prob=PROBNORM(z);
OUTPUT; END;
END; RUN;
PROC TABULATE
DATA=normal; CLASS
row column; VAR
prob; TABLE
row, column*prob=''*SUM=''*F=5.4/RTSPACE=5; LABEL
row='Z' column='Standard Normal Distribution';
RUN; |
- Notice that prob='
'*SUM=' '; removes variable
names prob and SUM from the table, otherwise they would appear
on the top of the table.

Student T Distribution
Table
(Adapted from Kim's book)
DATA
student; DO
df=1 TO 30 BY 1; DO
prob=.80, .85, .90, .925, .95, .975, .99;
t=TINV(prob,
df); OUTPUT; END;
END; RUN;
PROC TABULATE; CLASS df
prob; VAR
t; TABLE
df, prob*t=''*MEAN=''*F=7.4/RTSPACE=7 /BOX='T';
LABEL prob='Probability'; RUN; |

 |
|
|