home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Carousel Volume 2 #1
/
carousel.iso
/
mactosh
/
utilfil
/
datadoct.sit
/
DataDoc.doc
< prev
next >
Wrap
Text File
|
1987-05-23
|
19KB
|
378 lines
### DataDoctor ###
DataDoctor started out as a simple utility to insert tabs between the
observations of a data set or to replace tabs with spaces. But I found that
in the process of reading and writing a data file, the observations were
converted to a form that allowed selection and modification. As a result,
DataDoctor grew into a major tool for data editing and manipulation.
Along with its original Tab function, DataDoctor allows you to modify
your data in a number of ways. You can request that only certain columns be
written to your output file. Or you can specify that only those rows be
written for which a decision variable falls within a specified range. You can
add dummy variables, a serial index or a constant variable, you can designate
certain columns as row labels, and you can replace all occurances of one value
with another. Furthermore, you can do all this to data files that are too
large to be handled in any other way.
GETTING STARTED:
DataDoctor will read any ASCII (text) file regardless of what's in it.
That means you can tell it to open up a data set or the rough draft of your
first novel. It's really only useful with the data set however.
There is no limitation on file size, so you can read/write a 500k data
set as easily as a 5k set. It just takes longer. The only real limitation is
on number of variables. At this time, DataDoctor is restricted to 18 original
variables and 4 'extra' ones. Your data set can have more variables than
this, but only the first 18 (+4) will be processed.
After you choose an input file, DataDoctor will ask for an output file.
Make sure that there is plenty of room on the output disk. Once
reading/writing operations have begun, the program checks to make sure there
is always at least 5k of disk space. It will stop and close your files if
this limit is reached.
After designating input and output files, a full screen dialog window
appears. The top half is a variable list, with check boxes, variable numbers
and variable names. If your data set has a 'column heading' row, then the
variable names will be just what you want. If your data set starts
immediately with rows of observations, then the variable names will be the
observations themselves. If your data set starts out with a blank line, at
title, or other information,then either specious variable names will appear or
there will be no variable names at all. Click the 'Read Another Line' button
until you get to either a heading row or to a real row of data.
Select the columns you want written to your output file by clicking the
check box beside the variable you want. If you want all variables, click the
'Select All' button. (You can also 'unselect all' with this button.)
Note that DataDoctor is quite dependent on the number and placement of
variables in the variable list. It uses this heading information throughout
the program to keep track of which columns you have selected and to test your
decision variable or function. So be very careful that you keep skipping the
lead-in lines of a data set until you have come to a row that is fully
representational. (If you overshoot, unfortunately you can't back up, but you
can startover at the top of the input file by clicking 'Start Over'.)
ADDING EXTRA VARIABLES:
If your data set includes columns that do not have a heading (or do not
have an observation in the row you designate as the variable list), you can
add up to three 'extra' trailing variables. Here's an example:
Stock Price Volume Shares
ABC 4.56 789 876
DEF 3.45 678 901
... ... ... ...
... ... ... ...
GHI 6.78 901 234 .111
JKL 9.01 234 567
MNO 2.34 567 890 .222
(If you are not reading this in 9 pt Monaco, the columns in this and the
following examples will not 'line up'. You may want to switch fonts before
continuing.) The last column here does not have a column heading. Skipping
rows until you get to a representative row will waste data since the first
observation does not appear for a number of rows. This is exactly the case in
which you would want to add an extra variable. If you don't add the extra
variable to the variable list, the last column of observations will never be
read.
Extra variables are automatically added to the _end_ of each row, so if
your unlabled observations are in middle columns, you will need to do some
renaming of column headings.
Many data sets come with labels in the fist column. If this 'label'
column doesn't have a heading in the variable list, then all of your data will
be misread. Here's an example:
Price Volume Shares Divs
ABC 4.56 789 876
DEF 3.45 678 901
... ... ... ...
... ... ... ...
GHI 6.78 901 234 .111
JKL 9.01 234 567
MNO 2.34 567 890 .222
If you don't correct this, DataDoctor will read your rows like this:
Price Volume Shares Divs
ABC 4.56 789 876
DEF 3.45 678 901
... ... ... ...
... ... ... ...
GHI 6.78 901 234 .111
JKL 9.01 234 567
MNO 2.34 567 890 .222
In this case, you could simply skip a row to get to one that has a 'Stock'
observation, but you would then lack a 'Divs' column and would either lack
descriptive column headings or, if you replaced the observations with variable
names, would loose a row of data. To deal with this situation, DataDoctor
allows you to add a leading variable to the variable list. (Sorry, only one
is provided for.) This moves all the subsequent headings over by one
position.
If your data has observations that include separate words, then each
word will be considered an observation, which is something you may not want.
To get around this, DataDoctor allows you to specify label columns. The
observations in the label columns are written to the end or beginning of a row
with no spaces separating the labels. For example, suppose your variable list
looked like this:
A B C 4.56 789 876
The name of the stock is ABC, but this variable list is going to generate 6
variables: 3 for the stock name and 3 for the other observations.
Furthermore, the stock name will end up being written as: A <tab> B <tab> C
if you choose tab delimiters. To avoid this, you can designate up to 3
columns as labels. Above, you would choose variable numbers 1,2, and 3 to
be labels. You may then want to deselect variables 1,2, and 3 from the
variable list - otherwise they will appear twice: once as a variable, then
again as part of a label. (A column can be both a regular variable _and_ a
label simultaneously.) If you do this with the above data and choose trailing
labels, your rows will be written as follows:
4.56 789 876 ABC
You can choose whether the labels are leading (at the beginning of a
row) or trailing (at the end of the row). For spreadsheet use, you will
probably want leading labels. For many statistics programs, trailing labels
are necessay.
CHOOSING A DECISION VARIABLE:
DataDoctor allows you to select a range of data to be written to your
output file. To do this, first determine which variable you want to be your
decision variable. Make sure the 'Decision Variable' box is selected, then
click on the name of the variable (not the number and not the check box) in
the variable list. For example, in a set of stock prices, to choose
'Dividend' as a decision variable just click on the variable name 'Dividend'.
The variable name will appear in the highlighted decision variable box and its
variable number will appear above it. If you change your mind, just click on
another variable. If you want no decision variable, deselect the decision
variable check box.
Next set the range for your decision variable. Type in a number in the
lower bound box and another in the upper bound. Note that your decision
variable and its range _must_ be numeric. You can't have a decision variable
containing pet names with a range of 'cat' to 'dog'. If dividend yield is
your decision variable, you might type in .03 for the lower bound and .08 for
the upper bound. You cannot enter 3% and 8% because '%' is not numeric. In
fact, you could not use 'dividend yield' as a decision variable if the yields
were expressed in '%'s. (It's unfortunate but true that '$' is also not
numeric, so you can't use a decision variable that is expressed in $ amounts.
Exponential notation is also not numeric. If you have floating point data
expressed as, say, 9.24E02, then this variable cannot be used as a decision
variable.)
DEFINING A DECISION FUNCTION:
Instead of choosing a single variable as a decision variable, you can
define a decision function. Click the Decision Function button and a window
will appear. Enter a variable number for both the X and Y variables. (Both
must be designated. If you skip one, an alert will remind you.) Then enter a
numeric value for each of the constants. Finally, determine what operations
you want done. For example, you might choose X as 'dividend-yld', Y as
'capital_gain', c1 as 0.0 and c2 as 0.0. Then (X + c1) + (Y + c2) will give
you dividend yield plus capital gain i.e. the total return on a stock. This
'Total Return' is now your decision variable. Only rows of stock observations
that have a 'total return' between the specified upper and lower bounds will
be written to your output file.
Thre are two important things to note about the Decision Function.
First, it slows down output by quite a bit. This isn't important for small
files but may become very boring with big files. Secondly, DataDoctor has
only limited error checking abilities. A complex decision function could
possibly crash the program if irrational values are generated by the function.
ADDING AN INDEX OR CONSTANT:
If you click the Add Variables button, a dialog will appear asking you to
identiy the variables you wish to add. Four of these were discussed already
i.e. the leading and trailing extra variarables. But in addition, you may add
an index variable and/or a constant variable. Either of these can be made
into labels if you prefer.
The index can be used as a very versatile counter. Note that you can
set its starting point and increments. The constant variable may not seem
particularly useful at first, but it can be very valuable for tagging sorted
data sets and creating dummy variables. For example, you can output a
selected set of data tagged with, say, 1 and merge this with another selected
set of data tagged with 0. You now have a dummy variable (a column of 1's and
0's) for a regression analysis.
Any of the Extra Variables can be used as a decision variable, although
I'm not sure why anyone would want to use the Constant. The other variables,
however, can be very useful. Using the Index as a decision variable, you can
specify that only a certain number of rows be output. Using one of the blank
Extra's you can specify that only those rows for which that variable has value
will be output. Byt the way, since the decision variable must be numeric,
blank Extra variables are translated to '0' and output that way. So if you
want to change blank observations to '0', make the column involved a decision
variable with -1 as a lower bound.
CHANGE NAMES:
This button calls up a window that does just what it sounds like. It
lets you edit your column headings as you wish. Only about 12 characters of
the name will apear in the opening window's variable list, but the whole name
will be written to your output file.
If you have a data set with no headings and you want to create a set,
skip lines until you get to the first true row of your data. Then use Change
Names to change each observation value to a variable name. This row then
becomes your heading. Be aware that this destroys the first row of data - so
only do this if you have data to spare.
Note that you _cannot_ create variables with Change Names. The dialog
will be perfectly happy to let you type in extra names, and the main window
will display these names, but you cannot select them as variables because they
actually don't exist.
Nor can you remove variables with Change Names. If you delete a name,
the variable position still exists but is nameless.
REPLACE VALUES:
DataDoctor is very lenient about what form observations take. The only
time the actual type and value of the observation matters is for the decision
variable or function. But your statistics or charting program may feel
otherwise. If your data is filled with periodic 'n.a.'s to represent missing
values, you may want to use Replace Values to change all of these 'n.a.'s to,
say, -999.
The Replace Values function is _very_ limited and persnickity. It
looks only at whole observations, not parts. So it will only replace a value
if the whole observation exactly equals what you have typed into the Replace
Values window. For example, if you request that 'Sept' be replaced, an
observation of 'sept' will not be replaced. Be very careful about how you
type in your request. If you typed in 'n.a._' or '_n.a.' (with a leading or
trailing blank) then only 'n.a.' observations with this leading or trailing
blank will be replaced.
OTHER UTILITIES:
No LFs versus Add LFs - Mainframes and the Macintosh seem to have
very different ideas about linefeeds. If you have downloaded a linefeed
filled dataset (the linefeeds appear as little boxes), you may want to get
rid of them. If you are uploading a dataset, some computers require that you
add linefeeds; otherwise all the data tries to cram itself into one line.
DataDoctor will always remove linefeeds (it basically just ignores them)
unless you tell it to do otherwise.
Tabs versus Spaces - DataDoctor has to separate your observations
with something. If you choose spaces, it will insert 3 spaces between each
observation. If you choose Tabs, it will insert 1 tab and 2 spaces between
each observation. Please let me know if you would like other delimiters, like
commas, semicolons, etc.
Please note that DataDoctor is not designed to turn your data into a
'report quality' formated table. Your output data will not necessarily appear
in nice straight columns. The intent of the program is to convert your data
to a form that will be easily read by other programs, not other humans.
INPUT/OUTPUT:
Once you have made all the selections you need in the opening window,
click OK. DataDoctor will begin processing your data immediately. While it's
doing this, an Input/Ouput Progress window will show what's happening. You
can pause the program at anytime by clicking Pause and can view the current
output record by clicking View. (You should do this at least once to make
sure the output set looks like what you intended. Did you select the right
variables? Is your Replace request working?)
You can quit at any time by clicking Quit. Any data written to the
output file at that point will be preserved. Both your input and output files
will be closed and you will exit the program. Obviously this is something you
will want to do if a large data set is being read past a specified range. If
nothing more is being written to the output file, you may as well quit.
When the entire input file has been processed, the window will notify
you with 'Finished'. Click Quit and that's that. Note that you do have to
click the Quit button, even though all input/output has been completed. The
program does _not_ exit automatically. There are two reasons for this.
First, you can leave the room, have some coffee, read a book, etc. while the
program is working. When you come back, you will see exactly how many rows of
data were read and how many were actually written. This may be very useful
information. Second, you may wish to append a second set of selected data to
the first set in the output file.
Clicking Append closes the input file but not the output file. A
dialog will ask you for a new input file. (You can choose the same one if you
wish.) The opening window will then appear allowing you to make whatever
selections you need. You can redefine your decision function, add variables
or do whatever else you wish. Clicking OK brings you back to the Input/Output
Window where you will see (from the 'write' counter) that the new data set is
being added on to the original output file.
SOME OBSERVATIONS AND CAVEATS
DataDoctor is my first attempt at a full blown Macintosh application.
Writing it was very much a learning process. As a result, there are a number
of things lacking in the program.
*** First and Foremost *** DataDoctor has only limited error
checking. I have tried to provide traps for obvious errors, but will
certainly not have trapped every possibility. The program may well crash if
pressed too hard. So _please_ make sure your data is backed up before running
the program!
DataDoctor runs fine on my MacPlus with a hard disk in a fairly
complicated environment (i.e. with inits, fkeys, too many DA's etc.) But I
have no idea how it will do on other Macs. It uses up a lot of memory space,
so it may not work on a 128k Mac.
Things I would like to add to DataDoctor include:
-menus and access to DA's
-speed (I know it's slow.)
-ability to handle more than 18 variables.
-sorting ability
-full window views on the input and output files.
If I can get speed up and size down, DataDoctor would be particularly useful
converted to DA form.
If there's a utility or ability you really want added, let me know.
And if you have problems, crashes or encounter general wierdnesses, please let
me know about that too.
DataDoctor was developed using Lightspeed Pascal, which is an
absolutely wonderful programming environment. Many, many thanks to Think
Technologies!
Patricia Smith
325 E. 79th St.
NY,NY 10021
CIS: 70655,425