Wednesday, February 25, 2015

Parsing boring old CSV files with Java

It sometimes seems as if Java (JEE) skipped specifying a robust mechanism for handling CSV files. For instance, there is nothing like a schema description (XSD or DTD), which exists for XML files. Perhaps it was just too trivial, but I've found despite the existence of more "robust" formats like XML and JSON, there is still a lot of CSV files being used, which probably has a lot to do with the ubiquitness of Excel. But whatever the reason it would be nice to have a schema description of a CSV file, which could be used as sort of a interface description to bridge the gap between developers and business analysts.

I discovered something from Digital Preservation, which has created a very nice looking schema language for the CSV format; however, unfortunately, it was not well suited for Java. The library is written in Scala and comes with a Java bridge; however, at the time of writing this blog, the bridge didn't allow one to get much information about the exact errors and their positions because the result of the parsing was returned as a long, loosely formatted string.

There are other libraries, like Jackson CSV and Super CSV from sourceforge.net, which seem less sophisticated or ambitious than Digital Preservation's one; however, I wasn't impressed by them because the problem of bridging the gap between technical and business people doesn't seem to have been addressed. The building up of the CSV structure is largely done in Java code at run-time and is for a non Java person almost unintelligible.

So, for my last task, which involved parsing 3 types of CSV files, I resorted to using the apache common's  CSV parser and Java Annotations. What I did was to create a Pojo with all the columns contained in the CSV. Then, I invented some simple annotations which I used to decorated the Pojo with and which the business analyst could read and understand without much difficulty.

Below is a small example of one of these Pojos. You might laugh that I passed around a Java source code file as a CSV  file description but it worked well and people didn't seem to have problems finding what was relevant for them and understanding it. At first, it might look unintelligible; however, but with a little patience one can read through it.

Each Java class defines a CSV line, here one with 4 columns however in reality it was up to 396 columns. The lines' column order was described by the annotation @CsvColumnOrder (with Java reflection it's not possible to determine the order in which Field are declared; therefore, the extra annotation). The expected type of each column is described within the class definition as class attribute with further annotations like CsvChecked, CsvIsinCode, etc...


DbtPojo.java 
@CsvColumnOrder({
"IsActive",
"ExternalCode_ISIN",
"AmountIssued",
"PaymentDate"
});

public class DbtPojo {

  @CsvChecked(regex="^[YN]{1}", required=true)
   public String IsActive;

  @CsvIsinCode
  @CsvUnique 
  public String ExternalCode_ISIN;

  @CsvNumber(decimal = true)
  public String AmountIssued;

  @CsvDate
  public String PaymentDate;


}



Using reflection, it is possible during parsing to get information about relevant Pojo's fields by looking for the presence of  annotations. For instance, if the second column of a line has been returned by the Apache parser, then from the CsvColumnOrder annotation, one knows which field of the Pojo needs to be checked, namely "ExternalCode_ISIN".   With a reference to the Java Field, one can check for the presence of certain annotations by calling getAnnotation(Annotation class reference). If the annotation is present, here CsvIsinCode, one can react appropriately for a ISIN code.

I've included the code as a zip for those who are interested in more detail.

The reader should not be mislead into thinking that the CSV values from the files are being parsed into an instance of a Pojo as described above. In fact, the classes are never even instantiated. They are only used for their annotations and field definitions; that is for meta data purposes.


No comments:

Post a Comment