Carousel Volume 2 #1

home *** CD-ROM | disk | FTP | other *** search

/ Carousel Volume 2 #1 / carousel.iso / mactosh / utilfil / datadoct.sit / DataDoc.doc < prev next >

Wrap

Text File | 1987-05-23 | 19KB | 378 lines

### DataDoctor ### DataDoctor started out as a simple utility to insert tabs between the observations of a data set or to replace tabs with spaces. But I found that in the process of reading and writing a data file, the observations were converted to a form that allowed selection and modification. As a result, DataDoctor grew into a major tool for data editing and manipulation. Along with its original Tab function, DataDoctor allows you to modify your data in a number of ways. You can request that only certain columns be written to your output file. Or you can specify that only those rows be written for which a decision variable falls within a specified range. You can add dummy variables, a serial index or a constant variable, you can designate certain columns as row labels, and you can replace all occurances of one value with another. Furthermore, you can do all this to data files that are too large to be handled in any other way. GETTING STARTED: DataDoctor will read any ASCII (text) file regardless of what's in it. That means you can tell it to open up a data set or the rough draft of your first novel. It's really only useful with the data set however. There is no limitation on file size, so you can read/write a 500k data set as easily as a 5k set. It just takes longer. The only real limitation is on number of variables. At this time, DataDoctor is restricted to 18 original variables and 4 'extra' ones. Your data set can have more variables than this, but only the first 18 (+4) will be processed. After you choose an input file, DataDoctor will ask for an output file. Make sure that there is plenty of room on the output disk. Once reading/writing operations have begun, the program checks to make sure there is always at least 5k of disk space. It will stop and close your files if this limit is reached. After designating input and output files, a full screen dialog window appears. The top half is a variable list, with check boxes, variable numbers and variable names. If your data set has a 'column heading' row, then the variable names will be just what you want. If your data set starts immediately with rows of observations, then the variable names will be the observations themselves. If your data set starts out with a blank line, at title, or other information,then either specious variable names will appear or there will be no variable names at all. Click the 'Read Another Line' button until you get to either a heading row or to a real row of data. Select the columns you want written to your output file by clicking the check box beside the variable you want. If you want all variables, click the 'Select All' button. (You can also 'unselect all' with this button.) Note that DataDoctor is quite dependent on the number and placement of variables in the variable list. It uses this heading information throughout the program to keep track of which columns you have selected and to test your decision variable or function. So be very careful that you keep skipping the lead-in lines of a data set until you have come to a row that is fully representational. (If you overshoot, unfortunately you can't back up, but you can startover at the top of the input file by clicking 'Start Over'.) ADDING EXTRA VARIABLES: If your data set includes columns that do not have a heading (or do not have an observation in the row you designate as the variable list), you can add up to three 'extra' trailing variables. Here's an example: Stock Price Volume Shares ABC 4.56 789 876 DEF 3.45 678 901 ... ... ... ... ... ... ... ... GHI 6.78 901 234 .111 JKL 9.01 234 567 MNO 2.34 567 890 .222 (If you are not reading this in 9 pt Monaco, the columns in this and the following examples will not 'line up'. You may want to switch fonts before continuing.) The last column here does not have a column heading. Skipping rows until you get to a representative row will waste data since the first observation does not appear for a number of rows. This is exactly the case in which you would want to add an extra variable. If you don't add the extra variable to the variable list, the last column of observations will never be read. Extra variables are automatically added to the _end_ of each row, so if your unlabled observations are in middle columns, you will need to do some renaming of column headings. Many data sets come with labels in the fist column. If this 'label' column doesn't have a heading in the variable list, then all of your data will be misread. Here's an example: Price Volume Shares Divs ABC 4.56 789 876 DEF 3.45 678 901 ... ... ... ... ... ... ... ... GHI 6.78 901 234 .111 JKL 9.01 234 567 MNO 2.34 567 890 .222 If you don't correct this, DataDoctor will read your rows like this: Price Volume Shares Divs ABC 4.56 789 876 DEF 3.45 678 901 ... ... ... ... ... ... ... ... GHI 6.78 901 234 .111 JKL 9.01 234 567 MNO 2.34 567 890 .222 In this case, you could simply skip a row to get to one that has a 'Stock' observation, but you would then lack a 'Divs' column and would either lack descriptive column headings or, if you replaced the observations with variable names, would loose a row of data. To deal with this situation, DataDoctor allows you to add a leading variable to the variable list. (Sorry, only one is provided for.) This moves all the subsequent headings over by one position. If your data has observations that include separate words, then each word will be considered an observation, which is something you may not want. To get around this, DataDoctor allows you to specify label columns. The observations in the label columns are written to the end or beginning of a row with no spaces separating the labels. For example, suppose your variable list looked like this: A B C 4.56 789 876 The name of the stock is ABC, but this variable list is going to generate 6 variables: 3 for the stock name and 3 for the other observations. Furthermore, the stock name will end up being written as: A <tab> B <tab> C if you choose tab delimiters. To avoid this, you can designate up to 3 columns as labels. Above, you would choose variable numbers 1,2, and 3 to be labels. You may then want to deselect variables 1,2, and 3 from the variable list - otherwise they will appear twice: once as a variable, then again as part of a label. (A column can be both a regular variable _and_ a label simultaneously.) If you do this with the above data and choose trailing labels, your rows will be written as follows: 4.56 789 876 ABC You can choose whether the labels are leading (at the beginning of a row) or trailing (at the end of the row). For spreadsheet use, you will probably want leading labels. For many statistics programs, trailing labels are necessay. CHOOSING A DECISION VARIABLE: DataDoctor allows you to select a range of data to be written to your output file. To do this, first determine which variable you want to be your decision variable. Make sure the 'Decision Variable' box is selected, then click on the name of the variable (not the number and not the check box) in the variable list. For example, in a set of stock prices, to choose 'Dividend' as a decision variable just click on the variable name 'Dividend'. The variable name will appear in the highlighted decision variable box and its variable number will appear above it. If you change your mind, just click on another variable. If you want no decision variable, deselect the decision variable check box. Next set the range for your decision variable. Type in a number in the lower bound box and another in the upper bound. Note that your decision variable and its range _must_ be numeric. You can't have a decision variable containing pet names with a range of 'cat' to 'dog'. If dividend yield is your decision variable, you might type in .03 for the lower bound and .08 for the upper bound. You cannot enter 3% and 8% because '%' is not numeric. In fact, you could not use 'dividend yield' as a decision variable if the yields were expressed in '%'s. (It's unfortunate but true that '$' is also not numeric, so you can't use a decision variable that is expressed in $ amounts. Exponential notation is also not numeric. If you have floating point data expressed as, say, 9.24E02, then this variable cannot be used as a decision variable.) DEFINING A DECISION FUNCTION: Instead of choosing a single variable as a decision variable, you can define a decision function. Click the Decision Function button and a window will appear. Enter a variable number for both the X and Y variables. (Both must be designated. If you skip one, an alert will remind you.) Then enter a numeric value for each of the constants. Finally, determine what operations you want done. For example, you might choose X as 'dividend-yld', Y as 'capital_gain', c1 as 0.0 and c2 as 0.0. Then (X + c1) + (Y + c2) will give you dividend yield plus capital gain i.e. the total return on a stock. This 'Total Return' is now your decision variable. Only rows of stock observations that have a 'total return' between the specified upper and lower bounds will be written to your output file. Thre are two important things to note about the Decision Function. First, it slows down output by quite a bit. This isn't important for small files but may become very boring with big files. Secondly, DataDoctor has only limited error checking abilities. A complex decision function could possibly crash the program if irrational values are generated by the function. ADDING AN INDEX OR CONSTANT: If you click the Add Variables button, a dialog will appear asking you to identiy the variables you wish to add. Four of these were discussed already i.e. the leading and trailing extra variarables. But in addition, you may add an index variable and/or a constant variable. Either of these can be made into labels if you prefer. The index can be used as a very versatile counter. Note that you can set its starting point and increments. The constant variable may not seem particularly useful at first, but it can be very valuable for tagging sorted data sets and creating dummy variables. For example, you can output a selected set of data tagged with, say, 1 and merge this with another selected set of data tagged with 0. You now have a dummy variable (a column of 1's and 0's) for a regression analysis. Any of the Extra Variables can be used as a decision variable, although I'm not sure why anyone would want to use the Constant. The other variables, however, can be very useful. Using the Index as a decision variable, you can specify that only a certain number of rows be output. Using one of the blank Extra's you can specify that only those rows for which that variable has value will be output. Byt the way, since the decision variable must be numeric, blank Extra variables are translated to '0' and output that way. So if you want to change blank observations to '0', make the column involved a decision variable with -1 as a lower bound. CHANGE NAMES: This button calls up a window that does just what it sounds like. It lets you edit your column headings as you wish. Only about 12 characters of the name will apear in the opening window's variable list, but the whole name will be written to your output file. If you have a data set with no headings and you want to create a set, skip lines until you get to the first true row of your data. Then use Change Names to change each observation value to a variable name. This row then becomes your heading. Be aware that this destroys the first row of data - so only do this if you have data to spare. Note that you _cannot_ create variables with Change Names. The dialog will be perfectly happy to let you type in extra names, and the main window will display these names, but you cannot select them as variables because they actually don't exist. Nor can you remove variables with Change Names. If you delete a name, the variable position still exists but is nameless. REPLACE VALUES: DataDoctor is very lenient about what form observations take. The only time the actual type and value of the observation matters is for the decision variable or function. But your statistics or charting program may feel otherwise. If your data is filled with periodic 'n.a.'s to represent missing values, you may want to use Replace Values to change all of these 'n.a.'s to, say, -999. The Replace Values function is _very_ limited and persnickity. It looks only at whole observations, not parts. So it will only replace a value if the whole observation exactly equals what you have typed into the Replace Values window. For example, if you request that 'Sept' be replaced, an observation of 'sept' will not be replaced. Be very careful about how you type in your request. If you typed in 'n.a._' or '_n.a.' (with a leading or trailing blank) then only 'n.a.' observations with this leading or trailing blank will be replaced. OTHER UTILITIES: No LFs versus Add LFs - Mainframes and the Macintosh seem to have very different ideas about linefeeds. If you have downloaded a linefeed filled dataset (the linefeeds appear as little boxes), you may want to get rid of them. If you are uploading a dataset, some computers require that you add linefeeds; otherwise all the data tries to cram itself into one line. DataDoctor will always remove linefeeds (it basically just ignores them) unless you tell it to do otherwise. Tabs versus Spaces - DataDoctor has to separate your observations with something. If you choose spaces, it will insert 3 spaces between each observation. If you choose Tabs, it will insert 1 tab and 2 spaces between each observation. Please let me know if you would like other delimiters, like commas, semicolons, etc. Please note that DataDoctor is not designed to turn your data into a 'report quality' formated table. Your output data will not necessarily appear in nice straight columns. The intent of the program is to convert your data to a form that will be easily read by other programs, not other humans. INPUT/OUTPUT: Once you have made all the selections you need in the opening window, click OK. DataDoctor will begin processing your data immediately. While it's doing this, an Input/Ouput Progress window will show what's happening. You can pause the program at anytime by clicking Pause and can view the current output record by clicking View. (You should do this at least once to make sure the output set looks like what you intended. Did you select the right variables? Is your Replace request working?) You can quit at any time by clicking Quit. Any data written to the output file at that point will be preserved. Both your input and output files will be closed and you will exit the program. Obviously this is something you will want to do if a large data set is being read past a specified range. If nothing more is being written to the output file, you may as well quit. When the entire input file has been processed, the window will notify you with 'Finished'. Click Quit and that's that. Note that you do have to click the Quit button, even though all input/output has been completed. The program does _not_ exit automatically. There are two reasons for this. First, you can leave the room, have some coffee, read a book, etc. while the program is working. When you come back, you will see exactly how many rows of data were read and how many were actually written. This may be very useful information. Second, you may wish to append a second set of selected data to the first set in the output file. Clicking Append closes the input file but not the output file. A dialog will ask you for a new input file. (You can choose the same one if you wish.) The opening window will then appear allowing you to make whatever selections you need. You can redefine your decision function, add variables or do whatever else you wish. Clicking OK brings you back to the Input/Output Window where you will see (from the 'write' counter) that the new data set is being added on to the original output file. SOME OBSERVATIONS AND CAVEATS DataDoctor is my first attempt at a full blown Macintosh application. Writing it was very much a learning process. As a result, there are a number of things lacking in the program. *** First and Foremost *** DataDoctor has only limited error checking. I have tried to provide traps for obvious errors, but will certainly not have trapped every possibility. The program may well crash if pressed too hard. So _please_ make sure your data is backed up before running the program! DataDoctor runs fine on my MacPlus with a hard disk in a fairly complicated environment (i.e. with inits, fkeys, too many DA's etc.) But I have no idea how it will do on other Macs. It uses up a lot of memory space, so it may not work on a 128k Mac. Things I would like to add to DataDoctor include: -menus and access to DA's -speed (I know it's slow.) -ability to handle more than 18 variables. -sorting ability -full window views on the input and output files. If I can get speed up and size down, DataDoctor would be particularly useful converted to DA form. If there's a utility or ability you really want added, let me know. And if you have problems, crashes or encounter general wierdnesses, please let me know about that too. DataDoctor was developed using Lightspeed Pascal, which is an absolutely wonderful programming environment. Many, many thanks to Think Technologies! Patricia Smith 325 E. 79th St. NY,NY 10021 CIS: 70655,425