-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Add indexes for vm_stats #8737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add indexes for vm_stats #8737
Conversation
5ab6621 to
5faf462
Compare
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.19 #8737 +/- ##
============================================
- Coverage 30.91% 30.89% -0.03%
+ Complexity 34244 34242 -2
============================================
Files 5354 5355 +1
Lines 376071 376643 +572
Branches 54693 54807 +114
============================================
+ Hits 116255 116346 +91
- Misses 244520 244987 +467
- Partials 15296 15310 +14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8846 |
|
@blueorangutan test |
|
@vishesh92 a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
Hey, @vishesh92 Adding the I've actually ran into a situation where a user was retaining the last 30 days of VM statistics, but decided that they wanted to start retaining only 7 days. When changing the retain time from 30 days to 7 days, ACS had to delete 23 days worth of data. Because of the number of VMs and MGMT servers on his enviroment, they had about 1 billion entries to delete. ACS had trouble to perform this operation as the delete was being done in a single query that always timed out; thus the table started to grow endlessly. I've tried to add this exact index you are proposing; but, when I tested deleting 10 million rows of I've found another solution to the |
engine/schema/src/main/java/com/cloud/upgrade/dao/Upgrade41900to41910.java
Outdated
Show resolved
Hide resolved
|
[SF] Trillian test result (tid-9382)
|
@JoaoJandre I understand what you mean. But for day to day operations, the index on |
@vishesh92 could you share a description of the tests that you've done and their results? |
Sure. I created a table with 661683 entries for timestamp ranging from EXPLAIN ANALYZE SELECT * FROM vm_stats WHERE timestamp < '2000-01-01' ;Query plan with index +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| -> Filter: (vm_stats.`timestamp` < TIMESTAMP'2000-01-01 00:00:00') (cost=66558 rows=330170) (actual time=0.0104..47.4 rows=282075 loops=1) |
| -> Covering index range scan on vm_stats using temp_idx over (timestamp < '2000-01-01 00:00:00') (cost=66558 rows=330170) (actual time=0.00948..35.4 rows=282075 loops=1) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Query plan without index +---------------------------------------------------------------------------------------------------------------------------------------------+
| EXPLAIN |
|---------------------------------------------------------------------------------------------------------------------------------------------|
| -> Filter: (vm_stats.`timestamp` < TIMESTAMP'2000-01-01 00:00:00') (cost=66500 rows=220091) (actual time=0.0198..87.5 rows=282075 loops=1) |
| -> Table scan on vm_stats (cost=66500 rows=660340) (actual time=0.0186..64 rows=661683 loops=1) |
+---------------------------------------------------------------------------------------------------------------------------------------------+As you can see the index reduced the query time by ~40%. |
@vishesh92, from the description of your PR: You're proposing the addition of an index on the |
5faf462 to
ead5be1
Compare
ead5be1 to
74eaea3
Compare
@JoaoJandre We can't run explain analyze with delete commands. Adding the index also helps reduce the load on database while cleaning up the entries. Otherwise the DB will have to go through each record to filter out all the entries which can cause spike in I/O as well. I understand that if the administrator reduces the retention period, it can cause issues but having the index will speed up deletion which runs every minute. |
@vishesh92 you are proposing to add an index to improve DELETE operations; however, you are presenting results for SELECT operations. Indeed, indexes can improve SELECT operations, which only list the data. However, the DELETE operation works different: it removes data, recalculates statistics, reorganizes indexes and more. Please, present some tests and results for DELETE operations; otherwise, it is not feasible to sustain the reason for adding those indexes. |
@JoaoJandre Here are the results for a delete operation where the filter doesn't matches any rows. MySQL root@(none):test> DELETE FROM vm_stats WHERE timestamp < '1950-01-01';
Query OK, 0 rows affected
Time: 0.001s
MySQL root@(none):test> DELETE FROM vm_stats WHERE timestamp < '1950-01-01';
Query OK, 0 rows affected
Time: 0.001squery plan +------------------------------------------------------------------------------------------------+
| EXPLAIN |
|------------------------------------------------------------------------------------------------|
| { |
| "query_block": { |
| "select_id": 1, |
| "table": { |
| "delete": true, |
| "table_name": "vm_stats", |
| "access_type": "range", |
| "possible_keys": [ |
| "temp_idx" |
| ], |
| "key": "temp_idx", |
| "used_key_parts": [ |
| "timestamp" |
| ], |
| "key_length": "5", |
| "ref": [ |
| "const" |
| ], |
| "rows_examined_per_scan": 1, |
| "filtered": "100.00", |
| "attached_condition": "(`test`.`vm_stats`.`timestamp` < TIMESTAMP'1950-01-01 00:00:00')" |
| } |
| } |
| } |
+------------------------------------------------------------------------------------------------+
After dropping the index MySQL root@(none):test> DELETE FROM vm_stats WHERE timestamp < '1950-01-01';
Query OK, 0 rows affected
Time: 0.364s
MySQL root@(none):test> DELETE FROM vm_stats WHERE timestamp < '1950-01-01';
Query OK, 0 rows affected
Time: 0.362squery plan +------------------------------------------------------------------------------------------------+
| EXPLAIN |
|------------------------------------------------------------------------------------------------|
| { |
| "query_block": { |
| "select_id": 1, |
| "table": { |
| "delete": true, |
| "table_name": "vm_stats", |
| "access_type": "ALL", |
| "rows_examined_per_scan": 660340, |
| "filtered": "100.00", |
| "attached_condition": "(`test`.`vm_stats`.`timestamp` < TIMESTAMP'1950-01-01 00:00:00')" |
| } |
| } |
| } |
+------------------------------------------------------------------------------------------------+As you can see, without the index it takes around 0.36 seconds because it has to go through each row in the database. |
@vishesh92 vm_stats table data, timestamp column cardinality in both cases are the same? please check/confirm results with the filter matching rows, and same delete count. |
MySQL root@(none):test> DELETE FROM vm_stats_without_index WHERE timestamp < '1972-05-01'; -- without index
Query OK, 9213 rows affected
Time: 0.414s
MySQL root@(none):test> DELETE FROM vm_stats_with_index WHERE timestamp < '1972-05-01'; -- with index
Query OK, 9213 rows affected
Time: 0.100sMySQL root@(none):test> DELETE FROM vm_stats_without_index WHERE timestamp < '1980-01-01';
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
Query OK, 84840 rows affected
Time: 0.438s
MySQL root@(none):test> DELETE FROM vm_stats_with_index WHERE timestamp < '1980-01-01';
You're about to run a destructive command.
Do you want to proceed? (y/n): y
Your call!
Query OK, 84840 rows affected
Time: 0.375sIn case of a delete operation with some filter, before deleting the records database needs to fetch the records from the database. If the index is not present, this fetching of records from the database becomes slow and leads to scanning of all rows in the table resulting in high I/O on disk. |
thanks for sharing the results @vishesh92 , seems there is some improvement with the index. |
@vishesh92, I'm sorry, but I don't understand the point of a DELETE test where no rows are deleted. I've decided to test/compare the performance of deletion on the vm_stats table using a more realistic scenario. Consider the following: Without adding the proposed indexes, I deleted rows that had a timestamp lower then After, I added the indexes and deleted all the rows with timestamp lower then Great! the indexes made it much faster! roughly 30 times faster! Let's test deleting a whole day's worth of rows then: Without indexes: With indexes: The performance was about 15 times worst this time with indexes. This happens because of what I've explained in this comment: #8737 (comment). When deleting a small amount of the data (0.01% of it), we can find the specific rows we need to delete way faster using the indexes and all the overhead of updating them is worth it. However, when deleting a bigger amount of data (14% of the table) we can see the impact of the indexes' overhead, they make the deletion much slower. The above tests only consider the proposed PR, but I've also tested the performance of the indexes alongside what's proposed on #8740. I've tested making a query to delete a whole day's worth of rows with a limit of 100000 : Without indexes: With indexes: Deleting without indexes was 10 times faster then with indexes. Looking at the tests that I've done, I can agree with you that, for the most common case (deleting 1 minute worth of data), the index addition is worth it. However, if at any time the user decides that they want to retain less stats, the query to delete the excess stats might take much longer, and eventually timeout, leading to a snowball where ACS is unable to clean the Another point is that, while the feature on #8740 is optional, and by default turned off, the indexes proposed here are not optional (at least not without manual DB intervention). Taking this into consideration, I really think that adding the |
|
Thanks for the detailed breakdown on your deletion performance @JoaoJandre. Since you have so many rows in the vm_stats table, what's the performance like when you run a list VM command (eg.
Where is the timeout happening and could the timeout be increased? A slow deletion happening in the background isn't as impactful as a slow or unresponsive user interface, so I'd still rather have the index in place. |
rohityadavcloud
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - the change is largely from a large prod. env where table scans/queries were observed like:
SELECT vm_stats.id, vm_stats.vm_id, vm_stats.mgmt_server_id, vm_stats.timestamp, vm_stats.vm_stats_data FROM vm_stats WHERE vm_stats.vm_id =
It seemed that vm_stats table needed an index or two. I'm not a MySQL expert but the change in prod environment took the query from 15s to < 1s.
|
@JoaoJandre, could you make some tests listing and removing data with the proposed index? |
|
fwiw @borisstoyanov is helping to test this. |
|
[SF] Trillian test result (tid-9478)
|
|
@blueorangutan test matrix |
|
@borisstoyanov a [SL] Trillian-Jenkins matrix job (centos7 mgmt + xenserver71, rocky8 mgmt + vmware67u3, centos7 mgmt + kvmcentos7) has been kicked to run smoke tests |
@GutoVeronezi, @vishesh92, @rohityadavcloud, @sureshanaparti, @mlsorensen, @kohrar I've made some tests listing and deleting data from the vm_stats table using the proposed index, I've also taken the liberty to reproduce the tests using a index on ( First, I tested two listings: Select all the data on a VM; select the data on a VM between two specific dates. Without any indexes: Using the ( Using the ( Again, listing with indexes is much faster, we can see that the select with indexes are an order of magnitude faster then the select without an index. Furthermore, the ( Then, I did three delete tests: deleting one minute of data, deleting a whole day of data, and deleting a whole day of data with limit 100000. Without any indexes: Using the ( Using the ( From the delete tests, we can once again see that, when deleting small amounts of data, the index overhead is unnoticeable. However, when deleting larger parts of the table, the indexes make a large difference, with the deletion using ( Looking at the test results, we can make the following conclusions:
Therefore, based on these results, I believe that we could change the proposed ( |
| } | ||
|
|
||
| private void addIndexes(Connection conn) { | ||
| DbUpgradeUtils.addIndexIfNeeded(conn, "vm_stats", "vm_id", "timestamp"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| DbUpgradeUtils.addIndexIfNeeded(conn, "vm_stats", "vm_id", "timestamp"); | |
| DbUpgradeUtils.addIndexIfNeeded(conn, "vm_stats", "vm_id"); |
|
@JoaoJandre Can you share query times for this query? SELECT vm_stats.id, vm_stats.vm_id, vm_stats.mgmt_server_id, vm_stats.timestamp, vm_stats.vm_stats_data FROM vm_stats WHERE vm_stats.vm_id = 368 ORDER BY vm_stats.timestamp DESC; |
|
[SF] Trillian test result (tid-9496)
|
|
[SF] Trillian test result (tid-9497)
|
|
[SF] Trillian test result (tid-9498)
|
|
Sure @vishesh92 , here are the results: No index: Index on ( Index on ( The results/conclusions are the same: No index is very slow; the ( One point to consider is that this query is returning one whole week's worth of data for a VM and ordering it. That amounts to 40866 rows in this example (that have to be selected from 40 million rows in the table); It's expected that it should take some time. |
|
@blueorangutan package |
|
@vishesh92 a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el7 ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 8969 |
|
@blueorangutan test |
|
@rohityadavcloud a [SL] Trillian-Jenkins test job (centos7 mgmt + kvm-centos7) has been kicked to run smoke tests |
|
[SF] Trillian test result (tid-9524)
|
JoaoJandre
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only one question
engine/schema/src/main/java/com/cloud/upgrade/dao/Upgrade41900to41910.java
Show resolved
Hide resolved
borisstoyanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Merging based on reviews, smoketests and Bobby's manual testing. |
* Add indexes for vm_stats * Remove index on timestamp * Chnage index from vm_id,timestamp to vm_id
Description
Fixes #8775
This PR adds index on
vm_idcolumn ofvm_statstable.Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?