Skip to content

Azure Table indexing issues #7

@alexswan10k

Description

@alexswan10k

Hi,

Firstly, good stuff so far. This project is looking pretty interesting.

So, I am seriously considering using this, however, I have a couple of scalability concerns which I feel need addressing first. Let me clarify:

Azure table only indexes partition key and row key, which forms a sort of composite primary key. As I understand it there are no secondary indexes at all, so if I were to query a table field which is not partition or row key, I would effectively be running a 'partition scan' (a mini table scan).

Because of this, it is imperative to store (at least a copy) of the data in a way that is read-optimized if it is going to be used as a primary store. Having a dig through the source code it seems that there are basically two common types of query groups:

  • Get all events in stream

  • Get all events in stream where offset > myLocalOffset (there are others but this is most prevalent IMO)

  • Get a specific stream summary by streamId

  • Get all stream summaries (so I can determine if there are new events to accumulate over)

  • Get a specific stream summary (as above, probably more important in practice)

A couple of things jumped out at me -

  • If I want to query multiple stream summaries, I have to do a cross-partition query which is inefficient
  • If I want to do a range query, Position is not indexed, which means scanning all events every time.

The latter one worries me the most, and I think it can easily be fixed by simply rearranging the ID field with the Position field. As far as I can tell the event Ids are arbitrary and are rarely (if ever) queried against. Perhaps we can exchange the RowKey to actually be the position indexer and the Id as a secondary property? The benefit of this is range queries will be executed against a clustered index, and thus will be massively more efficient.

The summaries issue could be reduced by putting the summaries in a separate partition or similar where the RowKey is the streamId, although this has its own scalability issues if you have a lot of streams. Alternatively, you probably at least want an API that gets just a single stream by partition key and row key (stream id). This would be far more efficient than a cross partition query, and seems like the most common operation to me. I often know what I am projecting from, so to provide that explicitly in a query makes sense (although there is an argument to be made that I could just directly get the events after X, which will be 0 events most of the time).

I would be happy to contribute here or help out btw, just I am aware this is a massive breaking change for existing users so I didnt want to dive right in. Let me know if/how I can help though!

There is a very comprehensive article about all this indexing stuff, along with patterns of how to solve here:
https://docs.microsoft.com/en-gb/azure/cosmos-db/table-storage-design-guide#index-entities-pattern

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions