Copyright ©1996, Que Corporation. All rights reserved. No part of this book may be used or reproduced in any form or by any means, or stored in a database or retrieval system without prior written permission of the publisher except in the case of brief quotations embodied in critical articles and reviews. Making copies of any part of this book for any purpose other than your own personal use is a violation of United States copyright laws. For information, address Que Corporation, 201 West 103rd Street, Indianapolis, IN 46290 or at support@mcp .com.

Notice: This material is excerpted from Special Edition Using Java, ISBN: 0-7897-0604-0. The electronic version of this material has not been through the final proof reading stage that the book goes through before being published in printed form. Some errors may exist here that are corrected before the book is published. This material is provided "as is" without any warranty of any kind.

Chapter 8 - Tokens in Depth

by Jay Cross

Tokens are to computer language as words and punctuation are to human language. William of Ockham (a noted 14th Century Scholar famous for his support of simplicity) in his Summa Logicae went to great lengths to describe his theory of terms. While it is beyond the scope of this book to explain Ockham fully, the sense of it is that (for example) the word "chair" is not a chair, but rather a symbol-a reader or listener conjures up the thought of a chair when he reads or hears the word.

Using the same analogy, tokens are terms in source languages for computers. If a programmer declares a token "counter" to represent a short integer (a sixteen bit number described later in this chapter), then the compiler recognizes the token "counter" every time it is used in that context as referring to a specific 16 bits of memory somewhere. Any operations performed on "counter" are done with the value contained in those 16 bits; not with the token (the characters c, o, u, n, t, e, r), but with what that token represents to the compiler.

To accurately describe a task to a compiler, a description language needs to have a strict and unambiguous grammar structure. Java's grammar is fairly simple and elegant. You can begin understanding Java by learning about the tokens from which the more complex forms of expression are composed. These include keywords, identifiers, literals, separators, and operators. A Java program may also contain white space and comments that have no meaning to the compiler but are permitted for the sake of making the code's meaning clear to human readers-especially its author(s).

In this chapter you will learn:

Keywords

There are certain sequences of characters that have special meaning in Java; these sequences are called keywords. Some of them are like verbs, some like adjectives, some like pronouns. Some of them are tokens that are saved for later versions of the language, and one goto is a vile oath from ancient procedural tongues that may never be uttered in polite Java.

The following is a list of the 56 keywords you can use in Java. When you know the meanings of all these terms, you will be well on your way to being a Java programmer.

Table 9.1 The 56 Keywords Used in Java

abstract boolean break byte
case cast catch char
class const continue default
do double else extends
final finally float for
future generic goto if
implements import inner instanceof
int interface long native
new null operator outer
package private protected public
rest return short static
super switch synchronized this
throw throws transient try
var void volatile while

The keywords byvalue, cast, const, future, generic, goto, inner, operator, outer, rest, and var are reserved, but have no meaning in Java 1.0. Programmers experienced with other languages such as C, C++, Pascal, or SQL may know what these terms might eventually be used for. For the time being, you won't use these terms, and Java is much simpler and easier to maintain without them.

The tokens true and false are not on this list; technically, they are literal values for boolean variables or constants (boolean and other literals are described in the section on literals later in this chapter). As such, programmers should refrain from using them as identifiers (user defined names or labels).

Because these terms have specific meaning in Java, you can't use them as identifiers for something else, such as variables, constants, class names, and so on. However, they can be used as part of a longer token, for example:

public int abstract_int;

Also, because Java is case sensitive, if a programmer is bent on using one of these words as an identifier of some sort, you can use an initial uppercase letter. While this is possible, it is a very bad idea in terms of human readability, and it results in wasted man-hours when the code must be improved later to this:

public short Long;

It can be done, but for the sake of clarity and mankind's future condition, please don't do it.

There are numerous Classes defined in the standard packages. While their names are not keywords, the overuse of these names may make your meaning unclear to future people working on your application or applet.

Identifiers

Identifiersare terms chosen by the programmer that become tokens representing variables, constants, classes, objects, labels (which are like nouns), and methods (which are like verbs). As noted in the previous section, identifiers cannot be identical to Java keywords.

Identifiers in Java are a sequence of Unicode letters and digits of unlimited length. (Actually, the length may be limited by the maximum file size on the applet or application developer's system. Practically, this would limit an identifier to being less than two billion characters.) The first character of an identifier must be a letter. All subsequent characters must be letters or numerals. They do not need to be Latin letters or digits; they could be from any alphabet that Unicode supports, such as Arabic-Indic, Devanagari, Bengali, Tamil, Thai, or many others. For various historical and practical considerations, the underscore (_) and the dollar sign ($) are considered letters and may be used as any character in an identifier, including the first one.

Two tokens are the same identifier only if they are of equal length and if each character in the first token is exactly the same as its counterpart in the second token. This is case-sensitive and language-sensitive. This means that Latin letters are different from matching Greek letters, and letters with accents are different from letters without.

Most application developers are forever walking the line of compromise between choosing identifiers that are short enough to be quickly and easily typed without error and those that are long enough to be descriptive and easily read. Either way, in a large application it is useful to choose a naming convention that reduces the likelihood of accidental reuse of a particular identifier.

Legal identifiers Not legal identifiers
HelloWorld 9HelloWorld
counter count&add
HotJava$ Hot Java
ioc_Queue3 65536
ErnestLawrenceThayersFamousPoemOfJune1888 non-plussed

Table 9.2Examples of legal and illegal Identifiers

In the above illegal examples, the first is forbidden because it begins with a numeral. The second has an illegal character (&) in it. The third also has inappropriate character-the blank space. The fourth is a literal number (216) and cannot be used as an identifier. The last one contains yet another bad character-the hyphen or minus sign. Java would try to treat this last case as an expression containing two identifiers and an operation to be performed on them.

Literals

Literals are tokens representing values to be stored in bytes, shorts, ints, longs, floats, doubles, booleans, and chars. In addition, literals are used to represent values to be stored in string types. The following statements contain literals:

Clearly, there are several types of literals. In fact, the Java Language Specification gives five major types of literals, some of which have subtypes. The five major types are:

The following five sections of this chapter give more information about the different types of literals.

Boolean Literals

There are two boolean literals: true and false. There is no null value, and there is no numeric equivalent.

Character Literals

Character literals are enclosed in single quotes. This is true whether the character value is Latin alpha-numeric, an escape sequence, or any other Unicode character. Single characters are any printable character except hyphen (-) or backslash (\). Some examples of these literals are 'a', 'A', '9', '+' '_', and '~'.

The escape sequence character literals are of the form '\b'. That is within single quotes, a backslash followed by one of the following:

The meaning of the items from the first bulleted item above is probably familiar to C and C++ programmers, and anyone else should quickly recognize as needing a special way to represent the following:

Escape Literal Meaning

Character literals mentioned in the second bulleted item above are called octal escape literals. They can be used to represent any Unicode value from '\u0000' to '\u00ff' (the traditional ASCII range). In octal (base 8), these values are from \000 to \377. Note that octal numerals are from 0 to 7 inclusive. Some examples of these octal literals are:

Octal Literal Meaning

Character literals of the type in the last bulleted item above are interpreted very early by javac. As a result, using the escape Unicode literals to express a line termination character such as carriage return or line feed results in an end-of-line appearing before the terminal single quote mark. The result is a compile-time error. Examples of this type of character literal appear as the first six characters of each listing under the "Meaning" heading above.

Don't use the \u format to express an end-of-line character. Use the \n or \r characters instead.

Floating Point Literals

Floating point literals have several parts. They appear in the following order:

Part    Is it Required? Examples