home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Simtel MSDOS 1992 September
/
Simtel20_Sept92.cdr
/
msdos
/
batutl
/
bed11.arc
/
BED.DOC
< prev
next >
Wrap
Text File
|
1985-12-03
|
26KB
|
555 lines
BED - Batch EDitor VERSION 1.1
A data reformatting utility
(c) 1985 by Ken Goosens
Notice: this program is distributed free. You are free to use and
distribute it provided:
(1) no fee or other consideration is charged.
(2) the program is distributed only in unmodified form.
(3) you accept all responsibility for using this program. The
author does not provide any guarantee that this program works
properly and assumes no liability for use of it.
This program is supported by its author. Please send any comments or
enhancements to:
Ken Goosens
5020 Portsmouth Road
Fairfax, VA 22032
Or call Ken's bulletin board system at 202-537-7407 to leave a message
or download the latest version.
A complete set of files consists of
BED.DOC - this documentation
BED.EXE - compiled, executable code
BED.BAS - main program
BEDLIB.BAS - auxiliary BASIC routines used by BED.BAS, separately
compiled
BED.LIB - compiled assembler routines used by BED
REPORT.DAT - a sample report. We want to stip the report down to
pure data, omit commas from numbers, and rearrange
dates to year-month-day format with no separator
between the date elements.
REPORT.SPC - a sample editing specification that works on
REPORT.DAT
REPORT.BAD - A list of phrases whose presence should exclude a
line from the output. Needed by REPORT.SPC.
Acknowledgement - this program uses some assembler routines
distributed by Tom Hanlin in ADVBAS.LIB. ADVBAS is an excellent
shareware product that includes many useful assembler routines.
7 December 1985
* * * * * * * * * * * * * * * * *
TABLE OF CONTENTS
What is BED?
What Advantages does BED have over other Editors?
How Can BED be Used?
Exactly What Edits can BED do?
How to Invoke BED
Explanation of Each Editing Specification
The Order of Edits
How to Recompile BED
* * * * * * * * * * * * * * * * *
What is BED?
BED is a batch editor. EDITOR - because it modifies lines of text in
a file. BATCH - because it runs batch rather than interactively. The
only thing you specify interactively is what edits you want done.
What Advantages does BED have over other Editors?
o Runs batch.
BED is designed to run entirely unattended after you set up a
configuration file telling BED what to do. This makes it ideal for
production work where files are edited repeatedly the same way.
o Speed.
BED runs very fast. "Interactive" editors often include a macro
facility whereby they can run batch also. But, for each edit, they
usually page in part of the file that can be held in RAM, and do the
edit. Each edit makes a full pass through the data file. BED makes
exactly one pass through the data file, no matter how many edits are
done. BED is also much more efficient at doing global search and
replaces than most editors.
o Does complex reformatting simply.
BED is designed to make some complex edits very easy, such as
reformating date fields (e.g. removing separators and rearranging
fields, such as MM/DD/YY to YYMMDD) and changing formatted numbers
(dollar signs, parentheses around negative numbers, commas separating
thousands) to a "pure" data format (e.g. "($89,655.21)" to "-
89655.21"). BED also will preserve the original field length when
doing such edits by inserting filler blanks. These edits are either
very difficult or impossible using other editors.
o No theoretical or practical limit on file size.
Most editors either will not load large files or take forever to load
them before editing can begin, so that the time it takes to edit files
grows geometrically with file size. BED works just as efficiently on
large files as on small.
o Maximum line length of 32,676.
Most editors have a maximum line size of 256 characters or less. BED
will edit individual lines up to 32,676.
o Supports criteria for excluding entire lines.
Few editors allow you to select what lines to delete from a file. BED
allows entire lines to be omitted based on length or the presence of
keywords. Line exclusion criteria are applied before any other edits
are done.
o BED is freeform.
This means that the text you edit does not have to be in any special
location or column. BED will find the text to be edited no matter
where it occurs in the file or line. Data base editors that allow you
to reformat data always require that the data be broken into fields
which are either fixed in length and/or order. BED does not impose a
structure on the data. You do not have divide your file into fields
and records, nor do you tell BED where in the file the data is
located.
o Source code is provided.
You can fix bugs in BED or enhance it to do new tasks. Commercial
editors never give you the source code.
How Can BED be Used?
BED's primary use is to
o prepare data for reading into other programs.
BED is basically a data preparation utility for "cleaning",
"scrubbing", or reformatting data.
Often data needs to be changed before it can be loaded into a data
base management system for analysis and reporting. A typical problem
is that data contains characters that make it easier for the human eye
to read, but which a data base management system can not accept,
including
o "non-numeric" characters in numeric fields, such as a dollar
sign, comma, or parentheses around negative numbers
o dates with separators between the month, day, and year, and the
fields in the wrong order.
It is always good data processing policy to never store data with
formatting in it, but only to add formatting in reports.
Unfortunately, this policy is sometimes violated, and sometimes the
only dump available of data is a report (spreadsheets are among the
worst offenders). The presence of "formatting" characters in data
fields usually means that other programs cannot read the data.
Reports also typically include page headers, titles, page numbers,
blank lines, printer pagination commands (form feeds), as well as
blank lines, which need to be stripped out to leave pure data. BED is
expressly designed to eliminate such formatting. For example, data
managers usually store dates in YYMMDD format so that they will sort
properly, and they will usually batch load dates only in the format.
Yet the most common format for dates in the United States is MM-DD-YY.
BED allows you to strip off the dashes and rearrange the elements of
the date.
Exactly What Edits can BED do?
o Exclude entire lines
-that are shorter than a minimum length
-that are longer than a maximum length
-that contain any of a specified list of strings
Typical use: you must load data that came from a report and which has
blank lines in it and page headers. You must strip out these lines to
leave the pure data. You tell BED to exclude empty lines (shorter
than 1) and lines with "PAGE" in it. Form feeds, used by most
printers to cause a page eject, are automatically stripped out when
using the delete short line option.
o Global search and replace
-convert letters to upper case (all a-z to A-Z)
-convert any string to another string, including a target
string that is empty (omit string)
-delete all occurrences of specified characters
-character translation (replace specified characters by
specified characters)
Typical use: to decode or encode. For example, you want to convert
numeric codes for a geographic region to short words. Or, you need to
replace dollar sign ($) by a blank, as in "$500.23" to " 500.23".
o Fill lines shorter than a minimum with any character
Typical use: to convert a file with missing data on the end of a line
to a file with fixed length (e.g. LOTUS "PRN" files that have missing
columns on the end do not fill with blanks).
o Convert numbers enclosed in parentheses to negatives numbers
Typical use: to make financial reports machine readable, where
numbers like (9,750) will be rejected as non-numeric.
o Omit commas inside numbers
Typical use: to make numbers machine readable. For example,
4,200,500 needs to be reexpressed as 42000500.
o Change formats of date fields
-Remove separators between month, day, and/or year parts of
a date field
-Rearrange month, day, year fields in dates
-Convert a spelled month in dates to numeric format
-Convert year field in dates between 4 and 2 digit format
Typical use: to make dates machine readable. For example, LOTUS will
output a date as "Oct-85" but you need 8510. Or 07/20/85 must be
rearranged to 850720. Note: BED always takes spelled months to be
the first three characters of the English spelling for months,
ignoring case. The year element can be 4 or 2 digits. A numeric day
or month element are assumed to have a length of two, so that in "
7/28/65" the leading blank is assumed to be the beginning of the date
field.
How to Invoke BED
BED is invoked at DOS by typing
BED/[options] [file spec] ...
where the options are
B for running batch. You will not be asked for any
any input. BED will run completely unattended and return to
DOS.
F for file of inputs. The [file name] then refers to a list
of files that are to be consecutively edited.
The file specification has the format
[drive letter]:\[path]\[file name]
BED does not support wildcards. The [file spec] is NOT the name of
the file to be edited, unlike virtually all other editors. Except
when the /F option is specified, it is the file of specifications
telling BED what to do. These specification files are written out by
BED, based on previous full screen keyboard entry. Some examples:
BED No parameters. BED then asks you if you want to
Edit another file or Quit to DOS. If edit, will
ask you for file of specifications.
BED TEST.SPC Read the saved editing specifications in file
TEST.SPC, display them on the screen for possible
changes.
BED/B TEST.SPC Read the saved specifications in TEST.SPC, display
on the screen, then run them automatically.
BED/B TEST1.SPC TEST2.SPC C:\DB\TEST3.SPC
Batch run according to specifications in
TEST1.SPC, then TEST2.SPC, then C:\DB\TEST3.SPC.
You can give up to 10 files specifications in the
command line.
BED/F ALLRUNS Read the list of specifications for editing file
ALLRUNS, which is an unlimited list of file
specifications rather than a single file
specification. Each file name in ALLRUNS is on a
separate line.
BED/B/F ALLRUNS Same as above, except run all specifications
automatically rather than asking whether want to
run or edit.
Explanation of Each Editing Specification
(1) READ FR. Name of file to be read. The input file to be edited.
FR consists of variable length lines. Each line is
terminated by a carriage-return-line-feed. A line can have up to
32,676 characters in it. There is no limit on the size of this
file.
(2) WRITE FW. Name of file to be written. Cannot be same as input
file. Each line of input file is read, edited, and then written
to FW (unless excluded). Format of output is same as input.
(3) SAVE SPECS IN FS. Name of file the editing specifications are
saved in.
A save is executed prior to any run, except when run batch.
Specs must be saved before you can run batch.
(4) EXCLUDE LINES with a length less than XX. When a line is read in
that has a length less than XX, do not write it out. Exclude it.
Omit it.
Very useful when want to be able to check whether proper or
expected lines were excluded. Exclude option can also be used to
split files into 2 subsets.
(5) EXCLUDE LINES with a word in file FX. First read in all the
lines in file FX. If any line in file FX is contained in a line
in FR, that line will be omitted. So check each input line to
see if contains any line in FX that would trigger an exclusion.
(6) EXCLUDE LINES with a length greater than XX. When a line is read
in that has more than XX characters, do not bother editing it
further. Just omit it.
(7) SAVE LINES IN FILE FEXC. If any lines are excluded, write them
out to file FEXC.
(8) CONVERT TO UPPER CASE. Every line written out is to be converted
to upper case first.
(9) GLOBAL SEARCH AND REPLACE IN FSR. File FSR contains a pair of
strings where the all occurrences of the first are to be replaced
by second.
Each pair is separated by a comma. Quotes around each
string are necessary if it contains a comma. An example of the
contents of a FSR would be
"JACKSON","JONES"
comma,semi-colon
"help,","help:"
(10) DELETE THESE CHARACTERS. In the list of characters to follow,
all occurences in all lines are to be eliminated.
CAUTION: eliminating characters shortens lines. Since the
number of occurrences can vary, lines that were originally the
same length may become variable length.
(11) TRANSLATE FROM. List of characters to be replaced.
(12) TRANSLATE TO. List of replacement characters. Must be same
length as list translate from. Each character translate from is
replaced by corresponding character translate to.
If $a=d are what is translated from and :A)g is what
translate to, then "abcdABCD$2+=dD" becomes "ABCDABCD:2+)gD" is
the result. Note that translate never changes the length of a
string because it is only a 1 for 1 substitution.
(13) FIX LINE LENGTH pad lines shorted than XX. When an output line
would have fewer characters than XX, extend it with enough blanks
to fill it out to XX. Fill to the right. Leave lines with XX or
more characters alone.
(14) EDIT NUMBERS, convert parentheses to minus sign. Whenever find a
number enclosed in parentheses, put a negative sign in front of
the number and write the result out as a replacement. Will also
squeeze out trailing and leading blanks around number. Will
preserve original length by blank filling to right as necessary.
E.g.
(88.5)10/24/85 ( 9.1213 ) becomes -88.5 10/24/85 -9.1213 )
(-12.14 )xxxxx(aa) becomes --12.14 xxxxx(aa)
(12.a)(14.2 becomes (12.a)(14.2
(15) EDIT NUMBERS, omit commas. Commas inside of numbers are omitted.
Length of numeric field is preserved by filling to left or right
by blanks.
CAUTION: there is no infallible routine for recognizing a
numeric field properly. BED looks for a comma. If a numeric
digit is immediately to the left and three digits to the right,
BED assumes this is a numeric field.
The hardest thing for bed to do is to properly recognize
where a numeric field begins and ends. To get the beginning of a
numeric field, BED looks a maximum of two more characters to the
left of the first numeric digit to the left of the comma. If
these are numbers, BED assumes they belong to the numeric field.
If BED encounters a non-numeric digit, it assumes the numeric
field ends there unless the character is a plus or minus sign.
To find the end of the number, BED keeps looking 4
characters to the right of the last known comma in a numeric
field. If it is a comma, BED checks the three characters to the
right and assumes the numeric field continues if these are all
numeric digits; any non-numeric means the number ends. If the
4th character ahead is a decimal, BED assumes it is a part of the
number and then looks for the decimal part and continues until a
non-numeric digit is encountered or a user specified maximum
number of digits is encountered.
(16) EDIT NUMBERS, omit commas, right delimited? Yes means than each
numeric field terminates to the right with a non-numeric
character (or is at the end of the line).
This would happen, for example, if each numeric field has at
most 5 digits, but was written out left-justified in a field 6
characters long. Then a blank to the right would terminate each
numeric field.
This information is used by BED to decide whether to fill a
numeric field with blanks to the left or right. Telling BED
there is a non-numeric to the right of each numeric field assures
BED that it will properly find the end of each number, so it
shifts numbers with commas to the left and fills to the right
with blanks.
(17) EDIT NUMBERS, omit commas, maximum number of decimals. Numbers
may occur with or without fractional parts. Decimal numbers have
digits after a decimal point. Tells BED the maximum number of
digits in a decimal part.
This information is used by BED to determine where the end
of a numeric field is and whether to fill to the left or right.
If there are no decimals, BED shifts numbers to the left and
right fills. If there are decimals, BED uses this as a maximum
number of digits to include in the decimal part. If there are
decimals and numbers are not right delimited, BED shifts right
and fills to the left, hoping that numbers are right justified in
the data.
CAUTION: When strings of numbers occur together, BED may not properly
break these numbers into proper fields. For example, the number
"811,132" may represent an inventory of 81 and a value of 1,132 or a
single number 811,132. Or "2,811.5025" may mean a value of 2,811.50
with an inventory of 25, or the number 2,811.5025. BED has no
foolproof way of knowing where a number begins or end. BED relies on
context to break numbers into fields. BED will work properly if each
numeric field has a non-numeric character beginning and ending it. If
a non-numeric character always terminates the field, BED will work
properly when you tell it this, because it will shift numbers to the
left. CHECK WHETHER YOUR DATA HAS ADJACENT NUMERIC FIELDS THAT RUN
TOGETHER. Also, BEWARE COMMAS in non-numeric fields that can be
surrounded by numeric digits. The only kinds of cases BED can
misinterpret when deleting commas from numbers are:
(a) non-numeric fields that contain a comma but look like a numeric
field when adjacent to non-numeric fields. For example, if a
field has the "3,7" followed by "12C" and these occur together as
"3,712C", BED will misinterpret this as the number "3,712"
followed by "C" and hence produce either "3712 C" or " 3712C",
neither of which is right.
(b) no right delimiter, possible decimals, and an integer field
immediately in front of a second numeric field. For example, if
a two digit field precedes a four digit field, BED will
misinterpret "815,001" to be " 815001" rather than the proper "81
5001".
(18) EDIT DATES, number of digits in input year. How many digits are
in the year in date fields in the data. BED assumes that all
dates have the same number of digits in the year. Normally, this
will be 2 or 4, depending on whether the century is included.
(19) EDIT DATES, number of digits in output year. How many digits are
to be written out in dates with a year component. If 4 are read
in and 2 written out, last two will be used. If 2 are read in
and 4 written out, a "19" will be put in front.
(20) EDIT DATES, separator between day, month, and year fields in a
date. A single character, normally slash (/) or dash(-). Used
to identify potential strings as dates. Examine adjacent
characters to see whether they have date format.
(21) EDIT DATES, edit date with spelled month. Assumes spelled month
consists of first three letters of English spelling for months
(e.g. January is Jan). Removes separator and uses numeric order
for month, e.g. Mar-85 is 0385.
(23) EDIT DATES, spelled month, input date format. Where you specify
what elements (month, day, year) are present and in what order
they occur (e.g. day then month then year is DMY).
(24) EDIT DATES, spelled month, output date format. What order want
elements of date field to be written out. Allows you to
rearrange from input order.
(25) EDIT DATES, edit numeric dates. Each date field assumed to be 6
digit number with a field separator. Assume want to drop
separator and possibly change field order.
(26) EDIT DATES, edit numeric dates, input date format. Any two or
three letters of "DMY", telling what fields are present in what
order. E.g. "YM" means have only year followed by month with no
day.
(27) EDIT DATES, edit numeric dates, output date format. Same
possibilities as above, only used for output. BED takes each
field in input, matches where occurs in output. E.g. to
rearrange "12-05-82" to "821205", specify "MDY" for input and
"YMD" for output.
CAUTION: Beware fields where a date separator may occur in non-date
fields and be surround by numbers. For example, "if 12-10" really
means the equation 12 minus 10, it could be misinterpreted as the 10th
of December.
NOTE: Auxiliary files that can be used by BED, such as a list of
phrases that will cause a line to be excluded, or a list of strings to
search and replace, have to be created outside of BED using an editor.
The Order of Edits
The order that edits are applied can make a difference. For example,
if you substitute '*' for '$' and then remove all lines with '$',
nothing will be removed. Reversing the order, lines with $ will be
removed and then there will be nothing left to substitute. In BED
there are eight editing functions: the order in which they occur is
delete characters
convert to upper case
global search and replace
translate characters
omit commas from numbers
convert numbers in parentheses to negative
replace spelled dates
replace numeric dates
This is the default order that will be used if nothing else is
specified. However, in BED you can completely control the order that
these edits are performed. Where you would normally put "Y" to invoke
the function, simply put in "1" for the 1st to be performed, "2" for
the 2nd, etc. Any options selected with a "Y" will be invoked in
their default order after the numbered options are selected.
How to Recompile BED
BED was written in Microsoft QUICK BASIC, with some assembler routines
for speed.
First, compile BEDLIB.BAS and BED.BAS. The command to do this is
bascom bedlib /w/o,bedlib,nul.lst
bascom bed /w/o,bed,nul.lst
Then link all the programs together:
link BED+BEDLIB,BED,nul.map,BED.LIB