Performance issues

Apr 3, 2010 at 4:05 PM

Hi everybody,

I'm trying to utilize this library for a project, especially for the data processing part.

I need to process about 23000 files per run, each of them is a GZipped JSON. So, I slurp every file, decompress it, deserialize into a List and then process the elements (ending up by writing some of the elements to the output file). With PHP, the whole process took about 2secs per file (with increasing time during processing), which was hardly acceptable, so I switched to C#. Everything runs much faster, BUT: deserializing a structure of about 20k elements into a List takes still about 0.7 seconds (excluding unzipping and further processing, which come on top). Any idea on how to improve this? The underlying data structure is rather boring - 6 string elements, nothing really exciting.


Thanks in advance!


Apr 6, 2010 at 8:44 PM

Dmitri, try using JsonReader to process your file sequentially, record by record - this way you'll avoid deserializing everything at once. Apart from that, you can chain the GZIP stream reader (System.IO.Compression namespace) with Json reader and do file decompression & Json processing in single pass.


Apr 6, 2010 at 9:11 PM

Can you give me a code example?


Assuming GZStream is the - already opened - stream and I wanted to process the records (...of type Item), how would I do it?



Apr 7, 2010 at 8:00 AM

JsonTextReader, call Read() on it and then act on the TokenType and Value as necessary. It works in exactly the same fashion as XmlTextReader.

Apr 7, 2010 at 8:50 AM

I tried it yesterday... It's a way too low-level approach. Read() gets me one atomic token like String or Int or whatever, leaving me to parse it into the data structure. I would prefer something like:


while (Item e = reader.getNext()) {
  // do something with e

instead of fiddling with pieces of "e". For now, I'm just translating JSOn to CSV while receiving the data (using PHP) file by file, and then I can just string.Split() to get my records.


Apr 17, 2010 at 8:20 AM