Performance - large documents

Feb 14, 2011 at 6:12 AM

When using Json.NET as part of RavenDB, I managed to hit a performance bottleneck when reading large documents.

The test document in question is a 2.9 MB of JSON string, 1.56 BSON binary and 2.88 XML string. Those are all the exact same document, but with different encoding.

I tested reading those documents in each format, and I got the following results:

  • Reading JSON: 248 ms
  • Reading BSON: 258 ms
  • Cloning in memory: 114 ms
  • Reading XML: 43 ms

Note that there are several surprising things here:

  • Reading JSON is comparable or faster than reading BSON, I would have expected things to be the other way around.
  • Do in memory cloning (new JObject(anotherJObject); ) when the document is large is very expensive.
  • Reading the same document in XML is drastically cheaper 

The sample application, along with the test files, is available here:
http://dl.dropbox.com/u/6603892/ConsoleApplication1.zip

Initial profiling suggest that at least some of this is due to a lazy evaluation in the JContainer.ChildrenInternal implementation.

This was tested on Json.NET 4.0 R1

Coordinator
Feb 14, 2011 at 7:55 PM
Edited Feb 14, 2011 at 8:50 PM

There are improvements to be made when cloning a LINQ to JSON object by turning off validation and events on the new object when it is being populated.

For example each time a new property is set on a JObject a check is made to make sure that property doesn't already exist, which I'm sure is a big bottleneck on cloning JObjects with a lot of properties. I have been considering an internal flag that is set when cloning starts to disable the overhead.

The BSON and JSON readers are written the fastest way I know how - optimizing code to extremes isn't something I have a lot of experience with. How MS gets such good performance out of their XML reader is a mystery to me.

Both the readers are pretty self contained so if you come up with a way to do it quicker that still passes all the unit tests and the code is still somewhat sane, I'm happy to accept patches.

Feb 15, 2011 at 9:11 AM

I played around with some stuff, and it seems that for the most part, the actual cost isn't in the reading itself, it is in constructing large JObject.

I think that a major cost here is the internal structure of all the objects, they are essentially linked lists, and there are a lot of actions there that require you to traverse the entire path.

For example, adding an item to a deeply nested item requires traversing the whole chain to ensure the parent token, adding a single property requires traversing all properties.

Looking at the System.Json implementation, it seems that they are using the standard collections for storing the information.

Is there a very important reason to implement Json.NET in this fashion? Can this be changed?

Coordinator
Feb 16, 2011 at 6:28 AM

I did it that way because I wanted to maintain order of nodes and it was how LINQ to XML was written, which I largely emulated.

You're welcome to rewrite its internals if you'd like - it has pretty good unit test coverage so it should be easy to spot anything that breaks. Before you start anything though, I think performance of what there is now is acceptable for average so I'd only accept a patch that wouldn't brake existing code using Json.NET.

Apr 14, 2011 at 8:27 AM

Okay, I looked into trying to patch this, but it seems like it would be too complicated a job while maintaining backward compatability.

What I did instead was create my own set of DOM classes for JSON and wrote the adapters so they can use JsonReader and JsonWriter and friends. 

The overall result is about two fold increase in perf.

https://github.com/ayende/ravendb/tree/e65fecb17e22151da25acf32a00f8da6fc82c5f4/Raven.Json

May 22, 2011 at 12:46 AM

I too experience a performance issue, even with smaller documents. XDocument seems to be 2x as fast as JSON.NET when extracting raw data without doing Class mappings.

The very simple example is below. The XML is deserialized about 2x as fast. I find this very hard to believe. There has to be something happening in the JSON code that is incorrect.



            const long totalIterations = 1000000;

            const String xml =
            @"<?xml  version=""1.0"" encoding=""ISO-8859-1""?>
                <root>
                    <property name=""Property1"">1</property>
                    <property name=""Property2"">2</property>
                    <property name=""Property3"">3</property>
                    <property name=""Property4"">4</property>
                    <property name=""Property5"">5</property>
                </root>";

            const String json =
                @"{
                    ""Property1"":""1"",
                    ""Property2"":""2"",
                    ""Property3"":""3"",
                    ""Property4"":""4"",
                    ""Property5"":""5""
                }";


            var watch = new Stopwatch();
            watch.Start();
            for (long iteration = 0; iteration < totalIterations; ++iteration)
            {
                var obj = JObject.Parse(json);
                obj["Property1"].Value<Int32>();
                obj["Property2"].Value<Int32>();
                obj["Property3"].Value<Int32>();
                obj["Property4"].Value<Int32>();
                obj["Property5"].Value<Int32>();
            }
            watch.Stop();
            var performance1 = (totalIterations / watch.ElapsedMilliseconds) * 1000;
            Console.WriteLine("JSON: " + performance1);

            watch.Reset();
            watch.Start();
            for (long iteration = 0; iteration < totalIterations; ++iteration)
            {
                var doc = XDocument.Parse(xml);
                var alarmProperties = doc.Descendants("property");
                foreach (var property in alarmProperties)
                {
                    var attr = property.Attribute("name");
                    var name = attr.Value;
                    switch (name)
                    {
                        case "Property1": Int32.Parse(property.Value); break;
                        case "Property2": Int32.Parse(property.Value); break;
                        case "Property3": Int32.Parse(property.Value); break;
                        case "Property4": Int32.Parse(property.Value); break;
                        case "Property5": Int32.Parse(property.Value); break;
                    }
                }
            }
            watch.Stop();
            var performance2 = (totalIterations / watch.ElapsedMilliseconds) * 1000;
            Console.WriteLine("XML: " + performance2);

Jun 8, 2011 at 2:25 AM

I too thought that was kind of strange. Running AFinnell's example above I got the following results:

JSON: 23977ms  41000
XML: 15297ms  65000

Figured I'd try changing it to a dictionary instead, like so:

Dictionary<string, string> values = JsonConvert.DeserializeObject<Dictionary<string, string>>(json);

Int32.Parse(values["Property1"]);
Int32.Parse(values["Property2"]);
Int32.Parse(values["Property3"]);
Int32.Parse(values["Property4"]);
Int32.Parse(values["Property5"]);

And got the following results:

XML: 15387ms 64000
JSON: 11035ms 90000
Quite a difference there.

 

Coordinator
Jun 8, 2011 at 4:40 AM
Edited Jun 8, 2011 at 4:42 AM

I've updated the LINQ to JSON objects recently to dramatically improve performance for large documents. Try rerunning your performance tests with a Json.NET build from the latest source code.

http://json.codeplex.com/SourceControl/list/changesets