Java - Arrays and strings Java - Arrays and strings [ Arrays and strings | Histogram program (v1) | Histogram program (v2) | References | Histogram program (v3) | Histogram program (v4) ] Arrays and strings Arrays and strings are related concepts. Arrays An array is simply a sequence of memory locations for storing data set out in such a way that any one of the individual locations can be accessed by quoting its index number. In Java, as in many programming languages, the index numbers start at zero. The individual items within an array are called elements. Unlike some programming languages, in Java arrays have to be specifically created (or declared) and are of a definite size. It is not possible to change the size of an array once you have declared its size. You can, of course, copy the contents to another, larger, array. Here's a simple example of array declaration in Java. double[] values = new double[25]; This is easier to understand if it is written in the alternative format below. double[] values; values = new double[25]; The first line is the declaration of the identifier values and is similar to previous declarations except for the square brackets ([]) after the type identifier. The square brackets, as you've probably guessed, mean that values is going to be an array of double rather than a single instance of a double. The second line uses the new operator to create the array and the operator needs to know both the type and the size of the array. The type and size information together enable the JVM to determine how much memory is required. The new operator locates a suitably sized block of memory and associates a reference to the memory block with the variable values. Java arrays are objects and, as such, there are various methods associated with the java.util.Arrays class for manipulating them. These include sorting, searching and filling methods. The java.lang.System class provides an array copying method and all arrays have a 'field' or property called length which defines the number of elements in the array. When referencing an array element, the index must be one of the primitive integer types other than long. The exclusion of long is deliberate, the implication being that Java arrays cannot have more than 2,147,483,647 elements. (The history of computing is littered with the consequences of such restrictive decisions). If the value of an index is negative or greater than the array length then an ArrayIndexOutOfBoundsException is thrown. This is a welcome improvement over the behaviour of C/C++ programs which simply interpreted such array bound violations as an access to some other unrelated part of memory with usually disastrous and obscure consequences. In the jargon, Java programs perform array bound violation checking and, not surprisingly, this comes at a cost in performance as every index value has to be checked. Strings A string is simply any sequence of characters delimited with double quotes. Java has a (non-primitive) data type String. Of course a non-primitive data type is really a class and there are a number of methods associated with the java.lang.String class. However strings are useful and fundamental to many programming tasks and the Java language provides some support for string manipulations via the language, the main items of which are the string concatenation operator (+) which we have seen in earlier examples and the notation for a string constant that has also been seen earlier. If you are familiar with C and similar languages you may be wondering whether a Java string is an array of characters with a special terminator. The answer is, basically, "mind your own business". How Java strings are actually stored is a matter for the Java Virtual Machine writer, not the application programmer. The java.lang.String class includes methods for extracting a particular character from a string and for converting a string into an array of characters. The histogram program (first version) Here is a simple program that uses an array. It reads a file of data consisting of numbers in the range 0-99 and produces a histogram showing how many numbers occur in each decimal range (0-9,10-99 etc.,). The data file name is specified as a command line argument. Without further ado here's the code. import java.io.*; public class Histogram { static public void main(String args[]) { int[] totals = new int[10]; try { BufferedReader inFile = new BufferedReader(new FileReader(args[0]) ); try { String line; while((line = inFile.readLine()) != null) { int value; value = Integer.parseInt(line); value /= 10; if(value < 0 || value > 9) System.err.println("Value " + value + " out of range"); else totals[value]++; } inFile.close(); } catch(IOException e) { System.err.println("IO exception"); System.exit(1); } catch(NumberFormatException e) { System.err.println("Value " + e.getMessage() + "not numeric"); } } catch(FileNotFoundException e) { System.err.println( "Couldn't open " + e.getMessage() ); System.exit(1); } // Now display the histogram for(int row=0;row<10;row++) { System.out.print(row + " "); for(int col=0;col 9) System.err.println( "Value " + value + " out of range" ); else totals[value]++; This is, actually, quite straightforward. The variable value is used to store the input value after conversion. There is an enclosing try/catch to deal with conversion errors. To determine which array element to increment, value is divided by 10 remembering that dividing an int by an int yields an int. Before incrementing (using the ++ operator) an array element, value, is checked to see whether it is in range. It would have been possible to omit this check and then incrementation might have thrown an ArrayIndexOutOfBoundsException exception which could have been caught. The code here involves far less coding effort. Once the file reading loop has been terminated by the readLine() method returning null meaning end of file, it is time to print out the histogram as a row of asterisks. Here's the code. // Now display the histogram for(int row=0;row<10;row++) { System.out.print(row + " "); for(int col=0;col 9) System.err.println("Value " + value + " out of range"); else totals[value]+= "*"; } inFile.close(); } catch(IOException e) { System.err.println("IO exception"); System.exit(1); } catch(NumberFormatException e) { System.err.println("Value " + e.getMessage() + "not numeric"); } } catch(FileNotFoundException e) { System.err.println( "Couldn't open " + e.getMessage() ); System.exit(1); } // Now display the histogram for(int row=0;row<10;row++) { System.out.println(row + " " + totals[row]); } } } The array totals is now an array of strings that is initialised explicitly by the code. String[] totals = new String[10]; for(int row=0;row<10;row++) { totals[row]=""; } The appropriate string has an extra asterisk appended as required using the code. totals[value]+= "*"; The simplified output loop should be obvious. Although this version is attractively simple, this is bought at a big price in terms of performance. The reasons are not immediately obvious but worth understanding as this sort of coding practice can have a very serious impact on the preformance of larger programs handling larger data sets. For the simple set of data used in these examples, the extra cost of running this version of the program will be almost completely hidden in start up overheads; for larger data sets this will not be so. In the jargon, the program scales badly. The inefficiencies are related to the execution of the code line totals[value]+= "*"; Java strings are immutable which means you can't change a string. The only thing you can do is delete a string and replace it with a new one which may be derived from the old one. This avoids a number of potentially nasty and rather subtle problems. However, each time the line of code shown above is executed, a string has to be deleted and created behind the scenes. The next example will show a more efficient approach. Another interesting point concerns the need to initialise the strings to "" (the empty string). It would seem that, unlike arrays of numeric types, arrays of strings are not initialised to anything useful or sensible by default. If the array initialisation is omitted and the program run with the same data set, the program output is: 0 null 1 null** 2 null**** 3 null 4 null***** 5 null 6 null 7 null* 8 null 9 null* This seems odd. References If you declare Java int variables such as x and y then the compiler will reserve memory locations of sufficient size (32 bits) to store the int. The actual data values will be stored in these locations. This is typical of nearly all programming languages. If you declare a String the situation is rather different. The variable associated with the name (s) holds the address of a memory location where the actual body of the string is stored. Any assignment to a string will cause the Java interpreter to find a new memory location to hold the string text and adjust the contents of s to hold the address of the new location. The memory that held the old contents is freed. All non-primitive data types in Java are handled in the same way as the strings in the example above. The variables are called references and the stored values that "point" to where the actual data is stored are called pointers. It is not possible to determine the actual values of pointers in Java; there is no equivalent of C's & address taking operator. The two main advantages of using references are firstly that it allows the size of data objects to change dynamically and secondly that it reduces the amount of data to be passed around when objects are being passed to methods. There is one minor exception to the rule that pointer values are inaccessible and that is that if the variable is uninitialised it won't point to a data area at all; it will hold the special value null. Any reference can be compared with null. It is also important to note that if you compare two references then you are comparing the pointers and not the things that are being pointed at. There will, usually, be special comparison methods associated with the data types and they will be defined as methods within the class definition for the data type. Here's an example that illustrates the point with string comparisons. public class Strcomp { static public void main(String args[]) { String[] s = new String[3]; s[1] = "London"; s[2] = "L"; s[2] = s[2] + "ondon"; if(s[1] == s[2]) System.out.println("Strings the same"); else System.out.println("Strings differ"); if(s[1].equals(s[2])) System.out.println("Strings the same"); else System.out.println("Strings differ"); if(s[0] == null) System.out.println("Uninitialised string value is " + s[0]); } } And here's the output. Strings differ Strings the same Uninitialised string value is null The first comparison compares the pointer values (addresses) of s[1] and s[2]. These are different even though, as the second comparison shows, the strings actually contain exactly the same text. The final line of output shows what happens if you attempt to access the value of an uninitialised pointer. The result is rather surprising; you might reasonably have expected some sort of exception. This is what happens if you modify the code of the second comparison to become: if(s[0].equals(s[2])) System.out.println("Strings the same"); else System.out.println("Strings differ"); With this modification the program output is: Strings differ Exception in thread "main" java.lang.NullPointerException at Strcomp.main(Strcomp.java:15) This is a typical Java error message. Intriguingly, although the founding fathers of Java decided to remove all references to pointers from the language, they forget about the error messages. Another interesting point arose when developing this program. The first version included the code. s[1] = "London"; s[2] = "London"; With this code the first comparison reported equal. What had happened was that the compiler was clever enough to spot that the same string ("London") had been used twice and decided that there was no need to make more than one copy in memory, so both s[1] and s[2] will point to the same place. This is safe because Java string values are immutable which means that they cannot be changed. You can make a reference point to a different string (as happens with the two assignments to s[2] in the program) by simple assignment, the new string being constructed from the old. Once you've done this sort of thing the Java interpretive system can now release (and ultimately reuse) the memory occupied by the old string; this is an example of garbage collection. You might wonder what effect such garbage collection would have if the program had included the code s[0] = "L"; The answer is that the garbage collector counts references and will only release space if there are no references pointing to the space to be released. The histogram program (third version) If you have followed the points in the previous discussion about the immutability of Java strings then you will appreciate that the inner operation of the programs shown above is rather clumsy. Each time an asterisk is appended to a string, a new string is created and the old one is garbage collected. You might have wondered whether it is possible to associate a suitably large "chunk" of memory with the string and just work within that chunk. Of course doing this with ordinary strings is not possible due to the immutability constraint. However, Java offers a class called the StringBuffer class that provides just such functionality. Here's the histogram program re-written using string buffers. Code modified from the previous version is highlighted. import java.io.*; public class Histogram3 { static public void main(String args[]) { StringBuffer[] totals = new StringBuffer[10]; for(int i=0;i<10;i++) { totals[i]=new StringBuffer(); } try { BufferedReader inFile = new BufferedReader(new FileReader(args[0]) ); try { String line; while((line = inFile.readLine()) != null) { int value; value = Integer.parseInt(line); value /= 10; if(value < 0 || value > 9) System.err.println("Value " + value + " out of range"); else totals[value].append('*'); } inFile.close(); } catch(IOException e) { System.err.println("IO exception"); System.exit(1); } catch(NumberFormatException e) { System.err.println("Value " + e.getMessage() + "not numeric"); } } catch(FileNotFoundException e) { System.err.println( "Couldn't open " + e.getMessage() ); System.exit(1); } // Now display the histogram for(int row=0;row<10;row++) { System.out.println(row + " " + totals[row]); } } } There are several interesting points about this piece of code. The first step is to create and initialise the array totals using this code: StringBuffer[] totals = new StringBuffer[10]; It is important to understand that this is an array suitable for holding references to string buffers. It is not an array of string buffers. The actual string buffers will be somewhere else. This, seemingly subtle, distinction allows the string buffers to be moved and resized without upsetting the array. Only the array references are changed. The second step is to initialise the elements of the array totals to hold references to actual string buffers. This is done using a loop. for(int i=0;i<10;i++) { totals[i] = new StringBuffer(); } The statement in the loop creates a new string buffer with a default capacity of 16 characters (32 bytes) and returns a reference to the newly created buffer. Unlike string concatenation the string buffer append() method behaves as you might expect when you use it to append to an otherwise "fresh" string buffer. The capacity of the string buffer will increase as necessary each time you append characters to it. Notice that you cannot use the string concatenation operator "+". This operator is only defined for string operands; a string buffer is not a string. Having said that you might wonder about the output statement: System.out.println(row + " " + totals[row]); which does seem to use a string buffer in a context where a string is expected. The explanation is that the expectation of a string forces the "hidden" use of the toString() method associated with the StringBuffer class. (In fact, every object has a toSting() method that returns a string representation of the object, allowing it to be printed). "hidden" means, in this case, that you could have written System.out.println(row + " " + totals[row].toString()); However, since the println() method requires a string, the compiler has worked out that the string buffer object's toString() method must be used and has done it for you anyway. The Histogram program (fourth version) An experienced programmer may well object that the programs shown so far are all rather inefficient. Surely it would be better to store the counts in each category and then construct and manipulate the display strings once at the end of the processing run. This is a fair comment. Here is a version of the program that works this way. import java.io.*; import java.util.Arrays; public class Histogram4 { static public void main(String args[]) { int totals[] = new int[10]; int max = 0; try { BufferedReader inFile = new BufferedReader(new FileReader(args[0]) ); try { String line; while((line = inFile.readLine()) != null) { int value; value = Integer.parseInt(line); value /= 10; if(value < 0 || value > 9) System.err.println("Value " + value + " out of range"); else totals[value]++; } inFile.close(); } catch(IOException e) { System.err.println("IO exception"); System.exit(1); } catch(NumberFormatException e) { System.err.println("Value " + e.getMessage() + "not numeric"); } } catch(FileNotFoundException e) { System.err.println( "Couldn't open " + e.getMessage() ); System.exit(1); } // Find the largest total for(int i=0;i<10;i++) { if(totals[i] > max) max = totals[i]; } // Now create an array char[] output = new char[max]; // On each row, fill the array with required '*' and print for(int row=0;row<10;row++) { Arrays.fill(output,' '); if(totals[row]>0) Arrays.fill(output,0,totals[row],'*'); System.out.println(row + " " + String.valueOf(output)); } } } There are a number of changes here and there is rather more code than in the earlier examples. First the class java.util.Arrays is imported. This class includes a method that allows the programmer to efficiently fill an array with multiple copies of a single character. The actual counting is significantly simpler. totals[value]++ is quite sufficient. The output strings are constructed using an array of characters filled with the appropriate number of '*' characters. It is, of course, first necessary to create a sufficiently large array of characters. Before that can be done it is necessary to determine the largest number of '*' characters that will be required in an output array. This is done using the loop: for(int i=0;i<10;i++) { if(totals[i] > max) max = totals[i]; } Next the array is created: char output[] = new char[max]; And finally, here is the output loop: for(int row=0;row<10;row++) { Arrays.fill(output,' '); if(totals[i]>0) Arrays.fill(output,0,totals[row],'*'); System.out.println(row + " " + String.valueOf(output)); } Note the use of the fill() method. Initially the array is filled with spaces (partly to overwrite whatever was written on the previous iteration of the loop) and then filled with the required number of asterisks. Note that the code has not created an object of type Arrays so the use of the Arrays class methods have to be qualified with the class name. (In the jargon, the class has no constructor method and cannot be instantiated). Finally the output conversion from an array of char is done by "hidden" use of the String class method valueOf equivalent to the following code: System.out.println(row + " " + String.valueOf(output)); And as a final final point, it is worth noting that the code here outputs rows of asterisks with trailing spaces. All the output lines are exactly the same length. This could be a problem with some forms of output device or if other programs were being used to process the output. The String class method trim() can be used to resolve this difficulty thus: System.out.println(row + " " + String.valueOf(output).trim()); This page is part of a set of notes about the Java programming language prepared by Peter Burden for the module CP4044. A disclaimer applies to this page.