Friday, December 2, 2011

Sets (Java Collections)

Using collections in Java: Sets

Sets are used to test whether an item is "present or not" in the collection, but without allowing duplicates and generally without keeping items in a fixed order, unlike lists.

A List such as an ArrayList is useful in cases where we want to store some arbitrary large group of objects in a fixed order, and possibly refer to those objects by their position in the list. But a list is not so good for another type of problem where:

we don't necessarily care about ordering;
we don't care about the number of times that an object is in the collection;
we just want to know if a given object is in the collection or not.
A collection in which an object may be "present or not" (but not present multiple times) is called a set.

Why use a set?

It may have occurred to you that you could use a plain old list for the above purpose. After all, List has a contains() method to see if a given object is in the list, and to preserve the present-or-not condition, we could just check before adding if the given object was already in the list, and if so not add it again. So what's the point of a "set"?

The point is simply efficiency:

If your only necessary criterion is present or not and you don't need the extra functionality of a list (such as ordering, being able to retrieve the nth item), then it is possible to use a more efficient data structure. That's generally what Java's set implementations do.


How to use a Java set: the HashSet

Above, we said that a set was a collection in which an object can be present or not.

Constructing a Set

As with Java collections in general, Set is an interface. When we actually construct a set, we need to pick a specific type.

The usual type of set to choose is the HashSet. The name HashSet comes from the fact that the class uses a technique called hashing to quickly add, find and remove items from the set. As with lists, we can use Java 5's generics feature to specific that we want a set of a particular type of object. So here, we'll create a set of strings:


Set urlsProcessed = new HashSet(500);


We also pass in an estimate of the number of elements that we want to add to the set. We didn't bother with the list (but could and arguably should have done). Giving a number of elements is not essential for collections to work, but can often make them more efficient. It's important to note that this number is just an estimate for the sake of efficiency. It's not an absolute limit, as with an array. We can add more items than our initial estimate.

Adding items to the set...

Now that we have our set, whenever we want to add an item to the set (a string in this case), we call its add() method:


String url = "http://www.javamex.com/";
urlsProcessed.add(url);


...and finding them again

Whenever we want to find out if a particular item is in the set, we use the contains() method:


if (urlsProcessed.contains(url)) {
  // ...
}


And as you might have guessed, whenever we want to remove an item, we can use the remove() method:


urlsProcessed.remove(url);


Common uses for a set

Here, we call our set urlsProcessed. A common use for a set is to remember which items in a list have 'already been seen' so that they are not processed more than once. Imagine a program that spiders web pages by pulling the list of links out of each page it comes across, then going through the linked pages in turn: we need to only process a URL if it hasn't already been processed, otherwise we'd risk going round in an infinitive loop. With our set, we can write something like this to ensure each URL gets processed a maximum of once:


public void processURLs(List urls) {
  for (String url : urls) {
    if (!urlsProcessed.contains(url)) {
      process(url);
      urlsProcessed.add(url);
    }
  }
}


What if I add the same item more than once?

Recall that the point of a set is that it only allows a given item to be added once. If you try and add the same item agan, the add() method will effectively ignore the request. But its return value will tell you if the item was actually added (because it wasn't there before). In the next section, we'll see that we can use this return value to our advantage in certain cases.

A slight performance improvement

The previous example will work fine in many cases. However, it is slightly suboptimal in terms of performance. To check if an item is in the set, the contains() method must make a calculation (specifically, it must calculate something called the hash code). Then, the add() method must also calculate the same hash code to decide where to place the item. Ideally, we'd just like to calculate the hash code once.

Well, we said that we can't add an item to the set more than once, and that the add() method tells us whether or not the item was actually added because it wasn't already there. So if we don't mind adding the URL to the set just as we're going to process it, then we can do the following:

add the item before we decide whether to process it;
look at the return value from add():
if it returns true, that means that the item was added (because it wasn't already in the set), and so we do need to process the URL;
if it returns false, then the item wasn't added this time because it was already there— that obviously means we don't need to process it.
So the code looks something as follows:


public void processURLs(List urls) {
  for (String url : urls) {
    if (urlsProcessed.add(url)) {
      process(url);
    }
  }
}


Enumerating the contents of a set

As well as checking for individual items, we can iterate over all the items in a set with syntax very similar to that of iterating over a list:


System.out.println("URLS processed:");
for (String url : urlsProcessed) {
  System.out.println(url);
}


Recall that a set (and particularly a HashSet) is generally unordered. In other words, when you iterate over the contents of a set as in this example, you won't necessarily get the items out in the same order that you put them in (or in any other well-defined order).

Next...

On the next post, we look at a third type of Java collection, namely the map. Maps are used in all sorts of programming situations where we want to create associations between objects.

No comments:

Post a Comment