Thursday, June 29, 2006

Algorithms for Java (Programmers)

I recently finished reading Robert Sedgewick's two volume book - Algorithms for Java and I must say that it is very informative and well-written. The book is in two volumes and five parts - the first volume, consisting of parts 1-4, deals with basic data structures such as List, Stack, Queue, Tree, Set (and all their variations, some of which I had read about earlier and some which were totally new) and various sorting and searching algorithms built using these structures. The second volume (part 5) deals with Graphs and algorithms for searching and traversing these Graphs.

But what does Java programming have to do with understanding data structures, you ask? We, who have our Data Structure implementations built for us in the form of the Java collection classes, and who (rightly or wrongly) believe that if a data structure cannot be implemented in terms of a Map or a List, then it is probably not worth implementing at all? Why should we have to learn about different sorting techniques when we already have Collections.sort(), which requires us only to provide it with elements that implement the Comparable interface? Besides, business application programming rarely needs data structures more complicated than Maps, Lists (and occasionally Sets) anyway.

I would argue that Java programmers tend not to use advanced data structures and algorithms because they dont know them. The reason that they dont know them is that they almost never need to think in terms of implementing data structures and they are too lazy (or pragmatic) to waste time learning something they would never need. The question is usually what structure to use, rather than how to implement and then use a structure. Business problems that would be most elegantly addressed by using one of these advanced data structures tend to get addressed with a one-off hack custom-built for the purpose. Not a bad thing in itself, since the business problem got solved. However we just invented a wheel which may have been invented multiple times, and there may be better wheels out there that we could have chosen had we just looked.

The Java collections classes are quite good, but they contain two glaring omissions - Trees and Graphs and their various implementations. A possible reason could be that there are too many variations to implement and support. Or it could be because, given their complexity, they are less likely to be used, and hence was not considered useful enough to include in the collection classes. However, this means that, when you have a business problem that requires a Tree or Graph, you are pretty much on your own. Generally, the approach I have taken in the past is to come up with a custom implementation that generally is not very reusable.

Although I am sold on the need for programmers to have a solid understanding of formal data structures and algorithms, I did not feel like I was getting what I was really looking for out of Professor Sedgewick's books. Dont get me wrong. I feel like I really learned a lot from these books, there were data structures I did not know about that I know of now as a result of the reading. There were algorithms that I just did not understand (or understand the need for) when I was in college, which I do now after reading these books. Based on my new found knowledge, I can see these being applied to various aspects of my work.

So what exactly am I looking for in a book of this sort? First, I am looking for applicability. A lot of the data structures and algorithms in the book are ones that a Java programmer will never use, or never have to use. A C or C++ programmer would probably find these data structures and algorithms more useful. The general feeling I get is that these books are a quick rehash of the C and C++ versions (by the same author), with the C/C++ code cut out and replaced with similar looking code in Java.

Not to say that one should not know about them, but the book could have been a lot smaller and easier to read if the "non-essential" (to Java programmers) data structures and algorithms were not covered in so much depth. It would have been helpful, on the other hand, if the author highlighted what algorithm is used within the Sun JDK, and in what situations one would not want to use that algorithm and prefer a custom one instead.

The next thing I am looking for is the general coding style. Notwithstanding the language similarities between C/C++ and Java, the objectives, programmer mindset and therefore the coding style of these languages are quite different. Java by design strives harder for readability than C/C++. The examples look like C/C++ code that have been modified to use Java syntax instead.

As for the specific style, I prefer the K&R style myself, but have used the Allman style before. This book uses the GNU style, which I dont think is very popular among Java programmers, so I found the code a little difficult to read. The K&R style is definitely the most popular style used in Java today, as evidenced by the default style in the Eclipse IDE, and just as compact in terms of vertical whitespace as the GNU style, so it may help if the code was presented in that manner. This blog entry describes some popular coding styles in C/C++/Java like languages.

Another code style issue is the cryptic C-like variable naming in the code examples. Having gotten used to Spring's descriptive and rather long variable and class names, I find long variable names easier to map in my head when reading through a program.

But if Java programmers are too lazy (or pragmatic) to learn what they would never use, what happens when they are faced with implementing a structure or algorithm that is not available in pre-built form? Obviously books such as these, with its wealth of working code examples, can help. However, what would help even more, would be if they were implemented into real working ADTs (Abstract Data Types) and perhaps contributed to some place like the Apache Commons Collections package. That way, the book could become even more readable and concentrate on only the usage scenarios where a particular ADT implementation should be used and why.

I realize that much of this is heretical in a sense, and I apologize if I have offended the data structure and algorithm purists among you. My objective in this article was to highlight how excellent books such as this can be made even more useful for professional Java programmers. Java programmers generally dont care about the how to implement data structures as much as C/C++ programmers do, having been spoilt by the availability of built-in ADTs in the base JDK, they are more concerned about the question of which one to use in what situation.

9 comments (moderated to prevent spam):

Harith Elrufaie said...

Thanks for the interesting analysis! I feel like I want to read more about java data structures! And you're right that java programmers do not usually tend to read about data structure unless they're faced with application logic that requires different collection than list or a map! lol

Keep it up! I added your blog to my google homepage so i'll be watching all your updates!

Sujit Pal said...

Thanks for the comment. Revisiting data structures once one has some practical programming experience has proved to be quite useful for me. If nothing else, I now realize the rationale of why we were taught graph theory at college.

Unknown said...

Sujit - I have blogged on an interesting data structure that is coming on in Mustang. Have a look.

Unknown said...

hay cn u fwd me pdf of this book..
lovingrakesh@gmail.com

thnx dude :)

Sujit Pal said...

Sorry, I have the print versions of this book. I don't have the PDFs.

Unknown said...

Sujit, I am a little bit new on java, but I try to do most advanced things (whole genome sequencing), I am so willing to do something on this area, your blog is so useful, thank you, and please continue share something us :)

Sibel ( I also send request you from linkedin)
sibel.sisadmin@gmail.com

Sujit Pal said...

Cool! I have become interested recently in the genome because of my kids, and just read Genome by Matt Ridley for some background knowledge. Any pointers you can share on your work on genome sequencing?

Unknown said...

Sujit;

Actually, I am also new on bioinformatic, but I will focus on DNA sequencing by using whole genome data, DNa searching..But firstly I must learn java algorithm and also mahout :(

Sujit Pal said...

Cool, sounds very interesting!