Skip to content

Conversation

@cheqianh
Copy link
Owner

@cheqianh cheqianh commented Jun 21, 2022

Description:

Some changes to make the lazily_parse method work.

Highlights:

  1. We only talk about binary performance for now, even though I added text support
  2. I uploaded a test script that I used for testing.

Three main changes in this PR:

  1. A new class IonPyLazyObj to stand for an Ion value that holds a binary representation of the value.
  2. When read (load) Ion data, cache the Ion binary bytes and return an IonPyLazyObj holding these bytes.
  3. When write(dump) Ion data, append the binary to writer's buffer directly if the cached binary exists.

A little more explanation for the test file I uploaded:

  1. Given a binary format of [1,2][3], b'\xe0\x01\x00\xea\xb4\x21\x01\x21\x02\xb2\x21\x03'
  2. We load it to two lazy objects without deserialize it into two IonLists,
[
  IonPyLazyObj_first {
      type t = IonType.LIST
      buffer b = b'\xb4\x21\x01\x21\x02'
  }, 
  IonPyLazyObj_second {
      type t = IonType.LIST
      buffer b = b'\xb2\x21\x03'
  }
]
  1. When we dump above two objects back to ion binary, append the cached binary directly. Currently I wrote IonBlob bytes instead of the regular cached bytes, which should be changed.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Comment on lines +815 to +817
// TODO should append the c_buf to _value_stream directly. Not an Ion BLOB!
// E.g. Something like ION_PUT(writer->_typed_writer.binary._value_stream, c_buf);
IONCHECK(ion_writer_write_blob(writer, (BYTE*)c_buf, len));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good callout. We need to decide the best way to expose this functionality in ion-c. It's probably not an API we'd want to add to the public ion_writer.h header.

I wonder if we can just flush the writer before writing raw bytes, and then write those bytes directly to the underlying ION_STREAM...? If we can do that it may help us avoid having to add to the interface.

Copy link
Owner Author

@cheqianh cheqianh Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it works. But I need to investigate why ion_writer_flush() didn't work in my sample code so the output is not in a right order...

Current logic parses a list Lazy object to an Ion list including child lazy objects, we don't know the total length when we append the cached bytes in the output ion_stream so it requires a little more calculation.

To know how to calculate the length, the question then turns to how we handle a lazy list. My idea is:

  1. Simply create a lazy object, cache the list's binary representation and set the type to ION_TYPE.LIST like other Ion types.
  2. When users only need to append/delete objects, they can partially_deserialize the Lazy object into a list holding all child lazy objects in order. They can add new Ion values or delete existing values.
  3. When users need to modify a list and its child Ion values, they can fully_deserialize the lazy object into a regular Ion list class holding all deserialized Ion values. They can add, delete or modify values.

The second point above partially deserialize the list so we can reuse the cached binary of its child lazy objects. In other words, user just need to deserialize desired Ion values instead of a full list. Later on if users want to modify a specific child lazy object within the List/Sexp, they can deserialize that lazy object, modify it, and replaced the original lazy object with the modified Ion value. This sounds doable but still need a little more experiment, if partial deserialization and the strategy of modifying lazy objects is faster (and correct too), there is no need to do a fully deserialization. In addition we should consider Ion struct situation since it is unordered and allows duplicated keys.

Copy link
Owner Author

@cheqianh cheqianh Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I need to investigate why ion_writer_flush() didn't work in my sample code so the output is not in a right order

Figured out, I forgot to call it in the top-level.

printf("p_offset is: %" PRId64 "\n", p_offset);
printf("p_length is: %" PRId32 "\n\n", p_length);

// TODO is it possible to return a memory view pointing to the specific position of the original buffer?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Comment on lines +1547 to +1548
ion_reader_get_value_offset(hreader, &p_offset);
ion_reader_get_value_length(hreader, &p_length);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I forgot ion-c provided this functionality. It might not be well-tested, however... That is worth looking into. We want to make sure it's accurate in all cases.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will leave this comment unsolved first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants