UNIX Power Tools

UNIX Power ToolsSearch this book
Previous: 35.8 Centering Lines in a File Chapter 35
You Can't Quite Call This Editing
Next: 35.10 Splitting Files by Context: csplit
 

35.9 Splitting Files at Fixed Points: split

Most versions of UNIX come with a program called split whose purpose is to split large files into smaller files for tasks such as editing them in an editor that cannot handle large files, or mailing them if they are so big that some mailers will refuse to deal with them. For example, let's say you have a really big text file that you want to mail to someone:

% ls -l bigfile
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile

Running split on that file will (by default, with most versions of split) break it up into pieces that are each no more than 1000 lines long:




wc 



% ls -l
total 283
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         46444 Oct 15 21:04 xaa
-rw-rw-r--  1 jik         51619 Oct 15 21:04 xab
-rw-rw-r--  1 jik         41007 Oct 15 21:04 xac
% wc -l x*
    1000 xaa
    1000 xab
     932 xac
    2932 total

Note the default naming scheme, which is to append "aa," "ab," "ac," etc., to the letter "x" for each subsequent filename. It is possible to modify the default behavior. For example, you can make it create files that are 1500 lines long instead of 1000:

% rm x??
% split -1500 bigfile
% ls -l
total 288
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         74016 Oct 15 21:06 xaa
-rw-rw-r--  1 jik         65054 Oct 15 21:06 xab

You can also get it to use a name prefix other than "x":

% rm x??
% split -1500 bigfile bigfile.split.
% ls -l
total 288
-r--r--r--  1 jik        139070 Oct 15 21:02 bigfile
-rw-rw-r--  1 jik         74016 Oct 15 21:07 bigfile.split.aa
-rw-rw-r--  1 jik         65054 Oct 15 21:07 bigfile.split.ab

Although the simple behavior described above tends to be relatively universal, there are differences in the functionality of split on different UNIX systems. There are four basic variants of split as shipped with various implementations of UNIX:

  1. A split that understands only how to deal with splitting text files into chunks of n lines or less each.

  2. bsplit
    A split, usually called bsplit, that understands only how to deal with splitting non-text files into n-character chunks. A public domain version of bsplit is available on the Power Tools disc.
  3. A split that will split text files into n-line chunks, or non-text files into n-character chunks, and tries to figure out automatically whether it's working on a text file or a non-text file.

  4. A split that will do either text files or non-text files, but needs to be told explicitly when it is working on a non-text file.

The only way to tell which version you've got is to read the manual page for it on your system, which will also tell you the exact syntax for using it.

The problem with the third variant is that although it tries to be smart and automatically do the right thing with both text and non-text files, it sometimes guesses wrong and splits a text file as a non-text file or vice versa, with completely unsatisfactory results. Therefore, if the variant on your system is (3), you probably want to get your hands on one of the many split clones out there that is closer to one of the other variants (see below).

Variants (1) and (2) listed above are OK as far as they go, but they aren't adequate if your environment provides only one of them rather than both. If you find yourself needing to split a non-text file when you have only a text split, or needing to split a text file when you have only bsplit, you need to get one of the clones that will perform the function you need.

Variant (4) is the most reliable and versatile of the four listed, and is therefore what you should go with if you find it necessary to get a clone and install it on your system. There are several such clones in the various source archives, including the freely available BSD UNIX version. Alternatively, if you have installed perl (37.1), it is quite easy to write a simple split clone in perl, and you don't have to worry about compiling a C program to do it; this is an especially significant advantage if you need to run your split on multiple architectures that would need separate binaries.

If you need to split a non-text file and don't feel like going to all of the trouble of finding a split clone that handles them, one standard UNIX tool you can use to do the splitting is dd (35.6). For example, if bigfile above were a non-text file and you wanted to split it into 20,000-byte pieces, you could do something like this:

for 
> 

done < 









$ ls -l bigfile
-r--r--r--  1 jik        139070 Oct 23 08:58 bigfile
$ for i in 1 2 3 4 5 6 7   # [2]
> do
>       dd of=x$i bs=20000 count=1 2>/dev/null  # [3]
> done < bigfile
$ ls -l
total 279
-r--r--r--  1 jik        139070 Oct 23 08:58 bigfile
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x1
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x2
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x3
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x4
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x5
-rw-rw-r--  1 jik         20000 Oct 23 09:00 x6
-rw-rw-r--  1 jik         19070 Oct 23 09:00 x7

- JIK


Previous: 35.8 Centering Lines in a File UNIX Power ToolsNext: 35.10 Splitting Files by Context: csplit
35.8 Centering Lines in a File Book Index35.10 Splitting Files by Context: csplit

The UNIX CD Bookshelf NavigationThe UNIX CD BookshelfUNIX Power ToolsUNIX in a NutshellLearning the vi Editorsed & awkLearning the Korn ShellLearning the UNIX Operating System