Translations and conversions

For many many applications in FACE-IT, it will be necessary to translate from one data-type to another. Sometimes these can be file format conversions -- switching from comma-separated to space-separated or from Grib to NetCDF say. Other times they may involve more complicated rearrangements of data. Sometimes it might just mean changing the units on a value from Kelvin to Celsius. 

Some times it might be all three!

First a couple of definitions:
  • Sniffer -- A software component that guesses the Data Type of a given file. The sniffer is implemented as the method ‘sniff’ of the Data Type class. The sniff method returns true or false if the file matches a defined set of criteria.
  • Translation -- A data format conversion from a source Data Type to a destination Data Type independent of the complexity of the conversion itself.
  • Converter -- A special Galaxy tool characterized by two parameters: the source Data Type and the destination Data Type. Converters are installed within a given tool, enabling it to recognize different Data Types and convert the files into the format necessary for the tool. The converter xml description is very similar to the tool itself. 
Translator apps can be implemented in one of 2 ways depending on where they're needed and what they're needed to do. Complicated conversions that fundamentally rearrange the data type and change the format and whatnot, are probably best developed as standalone tools in the toolshed: one data-type comes in and the other comes out. See this link [insert] for a detailed example of someone developing a simple translator tool. Simpler conversions can be done with a converter. 

So how do you decide which approach is right for a given case? 

Usually converters are executed by the local scheduler sharing computing resources with the Galaxy GUI machine. This should be avoided if the conversion is computing intensive, as it will slow down other functionality. Converters can be convenient in the workflow canvas though, as they allow the conversion chain to be automatically planned by Galaxy: if a tool producing data belonging to a Data Type ‘A’ is able to "connect" (i.e. Galaxy will let you drag the pipe between) to a tool consuming data of a Data Type ‘D’, then that means that it can automatically find a conversion chain from A to D even with intermediate conversion steps to the Data Types C and B.

In conclusion, if the conversion computing time is negligible, using the Galaxy converters is the best choice because the conversion chain is automatically managed. Otherwise, if the conversion is computing intensive, using a tool executed remotely is the best choice. Above all the last note is really important: take into account the data movement performance pit if complex automatically managed conversion chains are used.
Comments