Friday, June 27, 2014

DataBasin: more powerful CSV writing for Select and Select-Identify

DataBasin, the clean-room impelmentation of a data extraction tool for SalesForce.com available for GNUstep and Mac just made a big leap forward!

If you have ever done a SELECT in salesforce.com, you might have noticed that the results are not ordered and that semi-joined fields ("." notation) are handled strange because a whole object is returned.

DataBasin extracts the  columns of the CSV file by checking the first row of the dataset and recursing on the field names of salesforce's response. The effects in DataBasin are the following:
  • The field order is not preserved
  • If a sub-object is queried, all those fields are grouped together in that object
  • If more than one field is queried on a sub-object and on certain record this object is missing, the number of columns between records is inconsistent: salesforce.com doesn't return the whole object and DataBasin iterates record after record in the sub-object, but if it it is totally missing, only one empty column will be written, not as many as needed
  • Capitalization of the column names doesn't get preserved (compared to the above issues, something really minor)

I added a new option available in write out fields in the order of the query.

This feature implies a more complex under the hood. For each record gotten in the response, the field names are reconstructed as before (e.g. by recursing inside sub-objects) and put in a Dictionary.

The original SOQL query is parsed and the field names extracted, then these names are used as keys for each record dictionary and written out.
  • Order preserved, also with sub-objects
  • Fields preserved and empty values are written even if whole sub-objects are missing
  • Case is preserved
This approach looks fine and comes with a relatively small performance penalty, however it is delicate because during parsing of the query, many naming subtleties of Salesforce.com needs to be inferred, that is, DataBasin tries to infer how the name in the Response will be:
  • field aliases
  • aggregate names with no alias, which are called progressively Expr0, Expr1
  • idiosyncrasies with aggregate functions. A field in a sub-objects is called Object__c.Field__c when in a regular query, but when used in an aggregate query, only Field__c is used, disregarding the join-depth
  • COUNT() vs.COUNT(Id)
I hope I have not missed any! in any case, enjoy!



This feature is available now both in Select and Select-Identify, since it is a problem of the CSV Write backend and enhances DataBasin operation usability greatly, bringing it on par with DataLoader.
Since DataLoader performces similar transformation and sometimes produces wrong results, I left this feature as optional with the flag "Write fields in query order".