3.4 Regex/Formatting & TokenizingHomepage  « Java6 Certification « 3.4 Regex/Formatting & Tokenizing

In our last lesson of the section we look at regular expressions and how we can use regular expression patterns for matching data. After this we look at formatting and tokenizing our data.

Lets take a look at the points outlined at the Oracle Website for this part of the certification.

  • Section 3: API Contents

    • Write code that uses standard J2SE APIs in the java.util and java.util.regex packages to format or parse strings or streams. For strings, write code that uses the Pattern and Matcher classes and the String.split method. Recognize and use regular expression patterns for matching (limited to: . (dot), * (star), + (plus), ?, \d, \s, \w, [], ()). The use of *, +, and ? will be limited to greedy quantifiers, and the parenthesis operator will only be used as a grouping mechanism, not for capturing content during matching. For streams, write code using the Formatter and Scanner classes and the PrintWriter.format/printf methods. Recognize and use formatting parameters (limited to: %b, %c, %d, %f, %s) in format strings.

Regular Expressionsgo to top of page Top

A regular expressions is a string containing normal characters as well as metacharacters which make a pattern we can use to match data. The metacharacters are used to represent concepts such as positioning, quantity and character types. The terminology used when searching through data for specific characters or groups of characters is known as pattern matching and is generally done from the left to the right of the input character sequence.

Regular expressions are a large topic, but here we will just cover the parts you need to know for certification. For a full list of regex constructs visit the Oracle online version of documentation for the JavaTM 2 Platform Standard Edition 5.0 API Specification and scroll down the top left pane and click on java.util.regex.

MeatacChar Meaning Examples
Escape/Unescape
\ Used to escape characters that are treated literally within regular expressions or alternatively to unescape special characters Literal Content
d matches the character d
\\d matches a digit character

Unescape Special Characters
d+ matches one or more character d
d\\+ matches d+
Quantifiers
? Matches preceding item 0 or 1 times do?
Every dig has its day
Every dog has its day
Shut that doooor
Can you see me
* Matches preceding item 0 or more times do*
Every dig has its day
Every dog has its day
Shut that doooor
Can you see me
+ Matches preceding item 1 or more times do+
Every dig has its day
Every dog has its day
Shut that doooor
Can you see me
Predefined Character Classes
. Matches any single character without newline characters except when the DOTALL flag is specified. \\.t
This Time tonight
this is good
\d Find a digit character.
Same as the range check [0-9].
\\d
Was it 76 or 77
\s Find a whitespace character. Example below words are greyed out and spaces are highlighted in red purely for emphasis
\\s
Beware of the dog
\w Find a word character.
A word character is a character in ranges a-z, A-Z, 0-9 and also includes the _ (underscore) symbol.
Same as the range check [A-Za-z0-9_].
\\w
76% off_sales. £12 only

See the Regular Expressions lesson for more code examples and usage of regular expressions.

Formattinggo to top of page Top

In this part of the lesson by looking at formatting our output and Java offers us different options for doing this. We will look at formatting data using the java.util.Formatter class as well as using the static format() method of the java.util.String class. We finish of our look at formatting output by looking at the printf() method contained in the java.io.PrintStream and java.io.PrintWriter classes.

Formatting Overviewgo to top of page Top

Producing formatted output requires a format string and an argument list. The formatted output is a String object which is derived from the formatting string that may contain fixed text as well as one or more embedded format specifiers, that are then applied to the argument list which can be set to null.

Format specifiers which have the argument list set to null have the following syntax:


// Format specifier syntax with null argument list 
%[flags][width]conversion
  • The optional flags is a set of characters that modify the output format where the set of valid flags depends on the conversion.
  • The optional width is a non-negative decimal integer indicating the minimum number of characters to be written to the output.
  • The required conversion is a character indicating content to be inserted in the output.

Format specifiers used to represent date and time types have the following syntax:


// Format specifier syntax with argument list for date and time types
%[argument_index$][flags][width]conversion
  • The optional argument_index is a decimal integer indicating the position of the argument in the argument list. The first argument is referenced by "1$", the second by "2$" and so on.
  • The optional flags and width are defined as above.
  • With dates the required conversion is a two character sequence where the first character is 't' or 'T' and the second character indicates the format to be used.

Format specifiers for general, character, and numeric types have the following syntax:


// Format specifier syntax with argument list for general, character, and numeric types
%[argument_index$][flags][width][.precision]conversion
  • The optional argument_index, flags and width are defined as above.
  • The optional precision is a non-negative decimal integer generally used to restrict the number of characters but specific behavior depends on the conversion.
  • The required conversion is a character indicating how the argument should be formatted, where the set of valid conversions for a given argument depend on the argument's data type.

The table below lists some conversions with their descriptions. You can find the complete list of flags and conversions in the API documentation for the java.util.Formatter class.

Conversion Symbols Description
aFormats boolean true or false
cFormats as a Unicode character
dFormats as a decimal integer
fFormats the argument as a floating point decimal.
oFormats as an octal integer
sFormats the argument as a string.
xFormats as a hexidecimal integer
ALocale-specific full name of day of the week, "Monday", "Tuesday"....
BLocale-specific full month name, "January", "February"....
YYear in format YYYY with leading zeros for years less than 1000

The java.util.Formatter Classgo to top of page Top

The java.util.Formatter class allows us to format output through a wide variety of constructors. The API documentation is extremely detailed and we are just showing an example so you get the idea:

See java.util.Formatter for code examples and usage.

The String.format() Methodgo to top of page Top

The String.format() static method allows us to format an output string and is overloaded to accept a format string and argument list or a locale, format string and argument list. In our example we will use the second overloaded method which accepts a locale, format string and argument list:

See String.format() for code examples and usage.

The printf() Methodgo to top of page Top

The printf() method allows us to format output to a java.io.PrintStream or java.io.PrintWriter stream. These classes also contains a method called format() which produces the same results, so whatever you read here for the printf() method, can also be applied to the format() method. For our example we will use the printf() method from the PrintStream class. If you remember from the Java I/O Overview lesson System.out is of type PrintStream and so will be used for convenience:

See printf() for code examples and usage.

Tokenizinggo to top of page Top

We finish off our tour of the Java API by looking at tokenizing our data. For this we will first look at the split() method of the String class which uses a regular expression delimiter to tokenize our data. After this we look at the java.io.Scanner class; objects of this class allow us to break input into tokens using a delimiter pattern which defaults to whitespace or can be set using a regular expression.

The split() Methodgo to top of page Top

The split() method will split a string around matches of the given regular expression, returning the results in a String array. The split() method is overloaded and will accept a regex string and a limit argument of type int denoting the number of times the pattern is to be applied. The second form just requires a regex string and in this form it is the same as invoking the split() method with the limit set to zero. An explanation of how values passed to the limit parameter affect the number of times the pattern is to be applied follows:

  • limit < 0
    Pattern will be applied as many times as possible, output array can have any length.
  • limit = 0
    Pattern will be applied as many times as possible, output array can have any length and trailing empty strings are discarded.
  • limit > 0
    Pattern will be applied at most limit - 1 times, output array length maximum <= limit and output array last entry will contain all input beyond last matched delimiter.

See the split() method for code examples.

The java.util.Scanner Classgo to top of page Top

The java.util.Scanner class is a simple text scanner which allows us to parse primitive data types and strings using regular expressions. Objects of this class allow us to break input into tokens using a delimiter pattern. The resulting tokens can then be converted into values of different types using one of the nexttype methods available in the java.util.Scanner class. In our example we show how to use the Scanner class with the default delimiter of whitespace and also with a delimiter created using a regular expression.

See the java.util.Scanner class for code examples.

Related Java6 Tutorials

API Contents - Regular Expressions
API Contents - Formatting & Tokenizing

<<  3.3 Dates, Numbers & Currencies                    4: Concurrency  >>

go to home page Java6 Tutor Homepage go to home page Top