Usenet 1994 January

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1994 January / usenetsourcesnewsgroupsinfomagicjanuary1994.iso / sources / misc / volume36 / unpost / part02 / unpost.doc < prev

Wrap

Text File | 1993-04-18 | 40.0 KB | 1,038 lines

UNPOST Name: unpost - Extract binary files from multi-segment uuencoded USENET postings or Email. Synopsis: unpost [-b[-]] [-c <configuration file>] [-d[-]] [-e <error file>] [-f[-]] [-h|-s|-u] [-i <incompletes file>] [-t <text file>] <source file> Where everything but the source file is optional. Description: UNPOST is a tool designed primarily to extract binaries from USENET binaries postings such as those made to alt.binaries.pictures.misc and comp.binaries.ibm.pc. As well as extracting binaries from USENET postings, UNPOST can extract binaries from multi-segment uuencoded mailings as well, however, to simplify this documentation only USENET article postings will be discussed. The principles are the same for multi-segment mailings. To avoid confusion, this documentation will refer to a single letter OR article as a 'segment'. For clarification on what a segment means to UNPOST, see Theory of Operations. Features: 1) PORTABILITY! UNPOST has been compiled and sucessfully run on MS-DOS, OS/2, Windows, Unix workstations, MacIntoshes, Amiga's and VAX/VMS systems. The code is written to be pure ANSI C, within reasonable limits. (some ANSI C capabilities are not used where they would be appropriate due to lagging compliance in most compilers. Hey, Unix types! MS-DOS (Borland C++ 3.1) is a MUCH better compiler than anything I've seen on a Unix workstation! And their debugger is the best I've used, as well). Unfortunately, there are still a lot of Unix boxes that have only a K&R compiler, so it may not port well to those. I personally check to make sure that it will compile and run on an MS-DOS box running MS-DOS 5 and Windows 3.1, using the Borland 3.1 C++ compiler, as well as a Sun (running SunOs 4.1.1 sun4c) using the gcc compiler (version 2.1). I know for a fact that the Sun cc compiler will NOT compile UNPOST succesfully. K&R compatibility is being considered, but it is a low priority feature. 2) CONFIGURABILITY! UNPOST comes with a default set of rules for detecting and parsing a VERY wide range of possible Subject: line formats, but no configuration can be correct for every situation. With that in mind, UNPOST can be configured by the user by creating a text file that contains the regular expressions, etc. that UNPOST uses to recognize, parse, etc. WARNING! UNPOST depends almost ENTIRELY on the contents of it's configuration file for correct operation. Regular expressions are complex, and writing one that works the way you expect it to takes care and, most importantly, experimentation. To this end, the standard UNPOST installation creates both the UNPOST executable and a regular expression test program called RETEST. RETEST is like grep, feed it a regular expression and a file, and RETEST will tell you what it matched and the sub strings that it extracted. 3) INTELLIGENCE! UNPOST uses every trick in the book to try to guess what the poster/sender REALLY meant. Also, UNPOST is not limited to finding all of it's information on a single line, or even in the header of a posting/letter. UNPOST has succesfully extracted binaries from postings that had, as a subject line, Subject: aaaa because UNPOST recognized the signature placed into the body of the article by a uuencode/split program. 4) FLEXIBILITY! UNPOST has switches that allow it to be configured to do different things for different tastes. For instance, UNPOST will intelligently sort out articles into four different classes: 1) Articles that are part of a complete and correct binary in the input file. These are sorted, concatenated, uudecoded and written out to a file name that is the same as that on the uuencode begin line. Depending on the setting of the file name switch, the file name of the binary may be modified. See below. 2) Articles that are pure text (no uuencoded data in them). If the -t switch and a file name are specified, these articles will be written out to the file for reading. Obviously, these articles should NEVER be encountered in a binaries news group, but not a single day has ever gone by that I did not see non-binary postings to binary news groups. 3) Articles that are part of incomplete postings (four parts, but only three have shown up so far), or that comprise a complete binary, but one that had an error in uudecoding, interpretation, etc. If the -i flag and a file name are specified, these articles will be written out to the file. If the -b switch is on, incompletes will be written to separate files. If both are on, those incompletes that can be guessed at as having a file name will be written to a separate file, all else will be written to the file named by the -i switch. In my experience, two types of articles end up in an incompletes file, those that have missing parts, and those that have been misinterpreted by UNPOST as belonging to a different binary than they really do. 4) Articles that are pure text that describe a posting (these are usually found only in the pictures groups). If the -d flag is set, and the binary to which they belong is correct and complete, this article, as well as the header and body up to the uuencode begin line of the first article, will be written to a file that has the same base name as the binary, but with the extension .inf. UNPOST automatically mungles binary file names to be MS-DOS compatible (the lowest common denominator). This is switch controllable, and can be turned on or off (depending on the default setting selected by the person who compiled UNPOST). UNPOST also has two lesser modes, sorted mode and uudecode mode. In sorted mode, UNPOST assumes that the articles still have headers, and that there may be un-uuencoded lines in the middle of a uuencoded file that have to be filtered out, but it assumes that all parts are present, and that they are in order. Header information, however, is ignored. If you use the incompletes file capability of UNPOST, you will notice that it writes out the segments that it did interpret correctly in sorted order. In uudecode mode, UNPOST acts like a simple uudecoder. UUencoded files must be complete, with a begin and end line, and no un-uuencoded lines can appear between the begin and end lines. However, uudecode mode is the ONLY mode where UNPOST will accept a short line (one that was space terminated, but had the spaces chopped off) as a legal uuencoded line and properly decode it. 5) INFORMATIVE! UNPOST is a very talkative program. It detects and reports many kinds of problems, tells you what it thinks is going on, and tells you what it is doing. All this information is written to standard error, or if the -e switch and a file name are specified, written to that file. Theory of Operations: UNPOST assumes that the source file that is given to it will have the following format: SEGMENT begin line ... HEADER ID line ... BODY ID line ... UUENCODED line Where the lines are: SEGMENT begin line - Is the line that identifies the begining of a segment. HEADER ID line - One or more lines that contain segment number, total number of segments or the ID string in the article or mail header. BODY ID line - One or more lines that contain segment number, total number of segments or the ID string in the article or mail message body. UUENCODED line - Is the first uuencoded line in the file. UUencoded lines include the begin and end lines. ... - Indicates zero or more lines that can contain any information so long as they CANNOT be misidentified as SEGMENT begin, ID or UUENCODED lines. Notice that the ID information can be spread across multiple lines. A segment is assumed to end at the begining of the next segment, or at the end of the source file. An UNPOST source file contains one or more segments. UNPOST has three different modes, interpretation mode, concatenation mode and UU decoder mode. In all three modes, UNPOST can accept one or more input files. In the first mode, interpretation mode, UNPOST looks at segment header and body lines before the first UU encoded line, and attempts to extract three pieces of information from them: segment number, total number of segments that the binary was split into, and an ID string that is common to all segments. If UNPOST finds something that it considers to be an ID string, and a uuencoded line in the segment, but it does not find a segment number and number of segments, UNPOST assumes that the segment is a single segment binary posting (part 1 of 1). To aid in finding out what happened, in interpretation mode UNPOST will write a list of all the different ID strings and their respective segment lists to standard error or the file specified as the error file (see Standards section for details of what an ID string is). Any errors or warnings detected during processing will also be written to standard error or error file. In interpretation mode three other files can optionally be created. All three of these files will contain segments copied out of the source file, and none of these files will be created unless they are turned on and named by a command line switch. The first optional file that UNPOST can create for the user in interpretation mode is the text file (-t switch). This file will have copied to it all segments from the source file that do not contain uuencoded data. Segments that are part 0/# type segments that do not contain uuencoded data will NOT be copied to the text file. They are considered to be description segments, and they will be copied to the description file only if the -d switch is turned on. Also, all binary postings that have all of their segments present will have the segment header and body of segment #1 (up to and including the uuencode begin line) copied into the description file. The third optional file that can be created in interpretation mode is the incomplete or unused uuencode data segments file. This file contains all segments that have uuencoded data, that were not used in a succesful uudecoding. This file will only be created if the -i switch is present. The incompletes file allows the user to hand decode those binarys which could not be interpreted or decoded by UNPOST. Often times, a binary will have all of it's parts, but UNPOST will not be able to put them together because of differences in the ID string between segments, or problems with the part numbering information. The simplest way to solve these problems is to collect the incompletes, edit the ID lines to correct the problem, and rerun UNPOST on the incompletes file. In the second mode, catentation mode, UNPOST assumes that all of the segments in the source file between a uuencode begin and a uuencode end line are part of one binary posting and that the segments are in order. UNPOST scans from the begining of the file until it finds a uuencode begin line, and decodes from there (skipping over non- uuencoded lines such as segment header lines and signatures) until it finds a uuencode end line. In the last mode, UU decoder mode, UNPOST assumes that the source file contains one or more UU encoded files. Only UU encoded lines are allowed between the uuencode begin line and the uuencode end line of any single uuencoded file. Example header: (1) Article 2096 of alt.binaries.pictures.misc: Newsgroups: alt.binaries.pictures.misc Path: csn!csn!convex!cs.utexas.edu! From: a43xz@brain.ac.da (Joe User) (2) Subject: ship.gif (1/3) Organization: Somewhere Near The Sea. Date: Fri, 19 Feb 1993 06:43:48 GMT Message-ID: <21128@brain.ac.da> Sender: news@dep.rnsft.ac.da (Usenet) Lines: 761 Picture of a ship in a bottle, full rigged. How did it get there? (3) section 1 of uuencode 5.20 of file ship.gif by R.E.M. (4) begin 644 ship.gif M1TE&.#=A@`+@`9<```0$!`0$!",G,"LG)RLG,"LG.30G,#TG)RLP,"LP.3TG In the above example, line (1) is the SEGMENT begin line, line (2) is a HEADER ID line, line (3) is a BODY ID line and line (4) is the first UUENCODED line in the body. Options: -b[-] Set this flag to make UNPOST write the incomplete uuencoded segments to separate files. This defaults to off. -c <file> To read and use a different configuration than the default configuration. The default configuration is stored in a file called def.cfg. -d[-] Turns on description capturing and writes descriptions to a file that has the same name as the output but with a .inf extension. This defaults to on. -e <file> Redirects error and information output from standard error to <file>. -f[-] Modify file names to be MS-DOS/USENET compatible. Use of -f turns file name modification on if the default is off, and -f- turns file name modification off if the default is on. File name modification is currently the default. -h Turns on full interpretation mode. This is the default. -i <file> Turns on incomplete binaries capturing and writes the segments to file <file>. -s Switch to ordered segment mode. This mode ignores segment headers, and assumes that the segments are in order. -t <file> Turns on text only segment capturing and writes the segments to <file>. -u Switch to uudecoder mode. Assume only uuencoded data between begin and end lines. Multiple uuencoded files are allowed. -v Show version number and quit. -? Show a summary of the command line switches. It is important to realize that UNPOST parses the command line in parallel with operations, so the order of the switches on the command line is VERY important. For example: unpost -d -e errors -i abpm.inc abpm.uue -c cbip.cfg -d- cbip.uue This will use the default configuration to process the file abpm.uue, writing out description files, writing errors to the file errors, and writing incompletes to the file abpm.inc. After UNPOST finishes processing abpm.uue, it will read in the cbip.cfg configuration, turn off writing description files and process cbip.uue. Note that the errors will continue to be written to the file errors, and that the incomplete binaries will continue to be written to the file abpm.inc. Since we are switching configurations, this is probably not a good idea. Standards: In all modes, UNPOST recognizes and decodes only uuencoded data. In interpretation mode UNPOST requires that: 1) The uuencoded lines be true uuencoded lines. This means that if trailing spaces are truncated by a mailer, editor or news node, UNPOST will not consider those lines to be uuencoded lines. Also, the uuencode character set recognized by UNPOST is ' ' - '`', with no other characters being legal. 2) That all segments of the same binary file posting have the same, recognizable ID string. 3) Segments have a recognizable SEGMENT begin line as the first line in the segment (denoting the begining of a segment). 4) That all ID lines follow the SEGMENT begin line in the segment. 5) That the first UUencoded line of the segment follows the last ID line. 6) That the first uuencode line in the first segment be a begin line. 7) That the last segment contain a uuencode end line. In sorted segment mode, UNPOST requires that: 1) The uuencoded lines be true uuencoded lines. This means that if trailing spaces are truncated by a mailer, editor or news node, UNPOST will not consider those lines to be uuencoded lines. Also, the uuencode character set recognized by UNPOST is ' ' - '`', with no other characters being legal. 2) That the segments be stored in the file in order. 3) That the first uuencode line in the first segment be a begin line. 4) That the last segment contain a uuencode end line. In uudecoder mode, UNPOST requires that: 1) There be only uuencoded lines between a uuencode begin and a uuencode end line. In this mode, UNPOST will recognize and attempt to repair lines that had trailing spaces truncated. Examples: To extract a single binary that had all of it's segments saved in order to a single file: unpost -s binary.uue To extract all binaries that have had all of their segments saved to a single file: unpost multiple.uue 2> errors Or unpost -e errors multiple.uue The file errors will contain a list of all the ID strings that UNPOST found and thought could have been binary files, and any errors that occurred during processing. To capture the incomplete or unused segments that have uuencoded data in them: unpost -e errors -i multiple.inc multiple.uue To capture descriptions and text only segments as well: unpost -d -e errors -t text -i multiple.inc multiple.uue To process two different files, one in uuencode mode, one in interpretation mode: unpost -e errors -u uuencode.uue -h multiple.uue To process a file that requires a different configuration: unpost -c -e errors multiple.uue Output: UNPOST will write diagnostic and informative messages to either standard error or the error file. The error file has three parts, interpretation errors (duplicate segments, missing uuencode begin lines, missing ID string, segment number or number of segments, etc.), a dump of the binaries found, the number of segments in each binary and the segment number and offset of each segment in the source file. The last part is a mixture of information (the name of the binary that UNPOST is attempting to decode) and any errors encountered during decoding. In the example below, UNPOST found one segment that had uuencoded data, the Subject: line had barber.gif as the ID string, the binary has one segment, and in the list of segments below, we see that segment number 1 starts at offset 583 in the source file. If there is a missing segment, it's segment number will be zero, and it's file offset will be zero. There were no interpretation errors, and there were no decoding errors. File ID Segments ---------------------------------------- barber.gif 1 1 583 Decoding Binary ID: 'barber.gif' Notes: To use this program to collect all of the binaries posted to, say, the alt.binaries.misc group on a daily basis, start up rn, go to the alt.binaries.misc newsgroup, and save all of the unread segments by using this command: .-$smisc.uue:j This will save all segments from the current number to the last to the file misc.uue, then junk them. After exiting rn, run UNPOST on the file misc.uue in interpretation mode (default mode): unpost -e errors -i misc.1 misc.uue Make sure to check the errors and/or misc.1 file for segments that UNPOST couldn't extract. Diagnostics: Error - file 'filename' already exists. UNPOST will not overwrite an existing file. Delete the file or rename it and try again. Error - missing begin line. UNPOST expected to find a uuencode begin line in this segment, but did not. Error - missing file name. The binary that UNPOST was attempting to decode does not seem to have a uuencode begin line in the first segment, so UNPOST has no idea what the file name is. Error - Could not open description file 'filename' for writing. UNPOST could not open a file of that name for some reason. Possibly a permission problem, or the file exists and is not writeable. Error - Bad write to binary file. A file write failed for some unknown reason. Possibly a full disk? Error - missing segment # Binary ID: 'binaryID' In attempting to decode a file whose ID string is binaryID, one or more segments are missing. Error - Missing UU end line. As this is the last segment, it ought to have a uuencode end line in it, but UNPOST did not find one. Warning - Early uuencode end line. UNPOST found a uuencode end line, but this was not the last segment, so we found it early. Did the poster screw up and misnumber his segments? Error - Unexpected UU begin line. We found an unexpected (read: this is not the first line of the first segment, so what is this doing here?) UU begin line. Error - cannot identify string '' in line # In reading in a configuration file, the configuration file lexical analyzer could not recognize this string. Error - Out of memory. Yup. Out of memory. Split the source file into smaller pieces and try again. Error - Could not modify file name to be MS-DOS conformant. File name mungling is turned on, and the name of one of the files cannot be made conformant (probably due to having to many numbers in it). Warning - Unexpected end of file in segment: Segment: 'segment line' File name mungling is turned on, and UNPOST is attempting to identify the file type (so it can use the proper extension when modifying the file name) but the UU begin line was the last line in the file. Warning - No UU line after begin. Segment: 'segment line' File name mungling is turned on, and UNPOST is attempting to identify the file type (so it can use the proper extension when modifying the file name) but the UU begin line was not followed by a line of UU encoded binary data. Error - Got number of segments but not segment number. Error - Got segment number but not number of segments. UNPOST must have all three pieces of relevant data, but if UNPOST has at least an ID string, UNPOST will attempt to assume a one part binary. Error - Could not get ID string. Fatal error, with no ID string, there is no way to collect the pieces together. Error - No begin line in first segment: Segment: 'segment line' UNPOST did not find a UU begin line in the first segment. Error - missing '}' in regular expression. In a regular expression of the type abc{1, 2}, the closing curly brace is missing. Error - To many sub-expressions. UNPOST has a limit on the number of sub-expressions it allows. This is a compile time option that can be changed by modifying the value of MAX_SUB_EXPRS in regexp.h. Error - missing ')' in regular expression. Mismatched parentheses. Error - badly formed regular expression. Unexpected character 'c' I give up! What is this character doing at this point in a regular expression? Error, can not enumerate a sub expression. Regular expressions of the type: (...)* are not allowed. Error - illegal regular expression node type. Whoops, we have an internal programmers error here. Let me know if you see this. Error - Sub expression # extraction failed. Another internal error that needs to be brought to my attention. Error - could not open file 'filename' for reading. UNPOST could not open file 'filename' for processing. Did you spellit right? Error - Unexpected end of file. Error - Unexpected UU begin line. Error - Segment number # greater than number of segments in: Segment: 'segment line' Either UNPOST got screwed up somehow or the poster posted something like (Part 10/9). Warning - duplicate segment # in: Binary ID: 'binaryID' UNPOST found two segments with the same binary ID and the same segment number. Error - reading source file. Could not read a line from the source file. Error - Could not open file 'filename' for output. Could not open one of the text, incomplete or error files for writing. Regular Expressions: Operands -------- UNPOST regular expressions have three types of operands, character strings (one or more characters), character sets and match any single character. A character string is any series of adjacent characters that are not not meta-characters (special characters). A data set is a string of characters enclosed in square braces with an optional caret (^) as the first character following the open square brace. The match any character operand matches any single character except the end of line character. A character string in a regular expression matches the exact string in the source, including case. Example of character strings: AirPlane - Matches the string 'AirPlane', but not the strings 'airPlane' or 'Airplane'. A character set will match any single character in the source if that character is a member of the set. If the first character of the set is the caret, the character set will match any character that is NOT a member of the set (including control characters!) except for NUL and LF. A character set can be described using ranges. Examples of character sets: [abcd] - Matches either a, b, c or d. [0-9] - Matches any decimal character. [^a-z] - Matches any character that is NOT a lower case alphabetic. The match any character operand does just that, it matches any character. But it does not match the case of no character, NUL or LF. Example of match any character: . - Matches any character. Operators --------- UNPOST regular expressions also contain operators. The operators that upost recognizes are the alternation operator, the span operators, the concatenation operator and the enumeration operators. The alternation operator has the lowest precedence of all the operators and its action is to attempt to match one of two alternatives. Example of alternation: Airplane|drigible - Matches either the string Airplane or the string drigible. The next higher precedence operator is the catenation operator. The catenation operator specifies that both the left and right hand regular expressions must match. The catenation operator does not have a special character, it is assumed to exist between two different operands that have no other operator between them. Example of catenation: [Aa]irplane - Matches either a 'A' or an 'a' followed by the string irplane. This is a catenation of the two regular expressions [Aa] and irplane. The next higher precedence operator is the enumeration operator. The enumeration operator specifies how many instances of a regular expression must be matched. Examples of Enumeration: abc* - Matches zero or more occurences of the string abc. [A-Z]+ - Matches one or more occurences of an upper case alphabetic character. [ ]? - Matches zero or one occurences of the space character. very{1} - Matches one or more occurences of the string very. b{1,3} - Matches a minimum of one to a maximum of three occurences of the string b. An enumeration operator attempts to match the largest source sub- string possible, except in the case of the . (match any character) followed by an enumeration operator. In this case, the smallest possible sub-string is matched. The precedence of the operators can be modified with the use of parentheses. Parentheses have another meaning as well, described below. Example of parenthesis use: Death( defying|wish) - Will match either the string 'Death defying' or the string 'Deathwish'. Without the parentheses, the regular expression would match either the string 'Death defying' or the string 'wish'. Sub Expressions --------------- UNPOST regular expressions are used primarily for identifying a particular line and extracting substrings from that line. To this end, UNPOST regular expressions support sub-expression marking. Subexpressions are marked by parentheses. To determine the sub-expression number of a sub-expression, scan the regular expression from left to right, counting the number of left parentheses. Start with one, and whatever the count for that sub-expression, is it's subexpression number. Example: .*((abcd)((0-9)+/(0-9)+)) Sub-expression ((abcd)((0-9)+/(0-9)+)) is sub-expression #1. Sub-expression (abcd) is #2. Sub-expression ((0-9)+/(0-9)+) is #3. Sub-expression (0-9)+ is #4. Sub-expression (0-9)+ is #5. Anchoring --------- Normally, a regular expression will match a sub-string any where in the source string. If you want to specify that the matching sub-string must start at the begining of the source string, you may use a caret character as the first character of the regular expression. This anchors the regular expression match to the start of the line. To anchor a regular expression to the end of a line, use the dollar sign character. This effectively matches the end of line or end of string character. Anchor operators have a higher precedence than alternation, but lower than catenation. Configuration: Ok, here's how to configure UNPOST to work for you. UNPOST relies heavily on regular expressions. These regular expressions may not be correct for your news reader, or system. There are five classes of regular expressions: 1) The SEGMENT begin line regular expression. 2) The ID line prefix regular expression. 3) The ID line with part description regular expression list. 4) The begin line regular expression. 5) The end line regular expression. Of these five, I don't expect you to have to modify the regular expressions for handling begin and end lines, because they should be correct for all uuencoders that follow the standard format. Be aware that UNPOST has a hierarchy of regular expressions. Each SEGMENT begin line regular expression has underneath it two lists of regular expressions that recognize ID line prefixes, and each element in the list of ID line prefix regular expressions has a list under it that attempts to parse the ID line. The two lists immediately under the SEGMENT begin line regular expression are for 1) the header and 2) the body. The ID line prefix regular expression exists for the sake of efficiency. It is used to find an ID line before we attempt to parse it. Modify or add one of these if you wish to change whether or not a line is recognized by UNPOST as being an ID line. If you modify this, you must modify the list of segment description regular expressions to match. The SEGMENT begin line regular expressions are used to find the begining of a SEGMENT, or the end of a previous segment. Modify these to change the line or lines that UNPOST recognizes as the begining of a segment. If you get an error message that indicates that the Subject line has no identifiable part description, and you see that some bright boy/girl has come up with a brand new part description format, then you have two choices, modify the source and hope they don't post again, or add a new ID line regular expression to the list of ID line regular expressions in the segment.c source file. Be aware that the lists of regular expressions are searched in order from top to bottom to find a match. This means that less specific regular expressions should be placed later in the list. For example: the regular expression '$(0-9)+/(0-9)+$' should come before the regular expression '(0-9)+ (0-9)+' in the part syntax parsing regular expression list. This reduces the number of misparses that occur. Remember that C uses the backslash (\) as an escape character in strings, so to put a backslash into a regular expression you need to put two into the C source string. All regular expressions can be found at the top of the parse.c source file. Before you modify the actual source code and recompile, I strongly suggest that you test your new regular expression using the regular expression test harness (retest) that was compiled by the makefile when you compiled UNPOST. Then, when you are sure that it is correct, copy the def.cfg file to a new name, make your changes there and use that configuration file for a while. If after all this, you are sure that it works, go in and change the source code in parse.c. Before you add or modify a regular expression, you have to know the syntax of the regular expressions used in this program. The syntax is very similiar to that used by UN*X style regular expressions, but is not exactly the same. See the section titled Regular Expressions before attempting to configure UNPOST. Configuration Files: If you don't want to make permanent changes to UNPOST's configuration, you can make a configuration file. Configuration files are parsed by UNPOST, the regular expressions compiled, and these regular expressions control the operation of UNPOST completely. A configuration file must have the following syntax: segment "..." { header { "...." { "..." { id # segment number # segments # alternate id # case ignore|sensitive } } } body { "...." { "..." { id # segment number # segments # alternate id # case ignore|sensitive } } } } Where "..." is a regular expression string, # is a sub expression number (See the section on regular expressions), and case is either ignored in regular expression string matching, or string matching is case sensitive. The outer most construct, starting with the segment "..." line is used to describe how to recognize the begining of a segment. The two constructs at the first level within the segment construct are used to identify lines that are expected to contain part # of # of parts information in the header, or the body of an article. Within each header or body group are regular expressions that are used to parse out the part # of # of parts information from an identified information line. A very simple example (taken directly out of the MUFUD documentation): segment "^Article[:]?" { header { "^Subject:" { "^Subject:(.*)part[ ]+([0-9]+)[ ]*(of|/)[ ]*([0-9]+)(.*)" { id 1 segment number 2 segments 4 alternate id 5 case ignore } "^Subject:(.*)([0-9]+)[ ]*(of|/)[ ]*([0-9]+)(.*)" { id 1 segment number 2 segments 4 alternate id 5 case ignore } } } body { } } Where: id 1 Specifies the sub expression number of the sub expression that is used to extract the binary ID string. segment number 2 Specifies the sub expression number of the sub expression that is used to extract the segment number. segments 3 Specifies the sub expression number of the sub expression that is used to extract the number of segments that this binary was split into. alternate id 4 Specifies the alternate sub expression number. If the first ID sub expression extracts only an empty string (or one with all white space), the string extracted by this sub expression is used to generate the binary ID string. case ignore Specifies that the case of alphabetical characters is to be ignored in regular expression string matching. See the def.cfg file for another (more complete) example. Default Binary Switch Settings: To modify the default values of the binary switches, edit the file compiler.h, and change the value of the defines. There are defines for file name mungling, breaking incompletes into separate files and for dumping out description files. Bugs: This program has been pretty extensively tested in interpretation mode, and it appears to be both robust and flexible. Unfortunately, about once a week, somebody comes up with a new and unusual way to encode the parts description on the Subject line. Bugs, after being found, are chased unmercifully and terminated with extreme prejudice. If you think you've found one, send all information opinions, prejudices and critcisms to me, and the hunt will begin (just as soon as I can put on my safari hat and grab my debugger. . .). Rights, Copyright, Legal stuff, etc: This program is distributed free of charge, but it has NOT been placed in the public domain! I retain copyright. Why? Because I am pathologically commited to producing and maintaining a quality product, and if every Tom, Dick and Susan modifies UNPOST and redistributes, I will not be able to respond to bug reports or continue to upgrade the product (branch revision problem, and all that. . .). My job isn't done so long as a single bug exists, or even one user is unhappy (or even one system is uninfected. . . er, that is to say, supported :-). However, I am also dedicated to the principle of maximum use. If you wish, you may modify anything in this program you want to. That's why I distribute source. BUT, you may NOT distribute your changes, unless you can be legally sure that you have made so many as to make what you distribute a new work. If you learn any bad habits from reading my source code, tough luck. And if anything in this section is not legally supportable, the joke's on me. Don't bother telling me, I'm to busy coding (So THERE!). Author: John W. M. Stevens - jstevens@csn.org