In the era of big data and cloud computing, a large number of companies choose to build data warehouses on public cloud platforms (such as Amazon Cloud, Alibaba Cloud, etc.) and use OLAP technology to perform efficient data analysis.
How to conduct data analysis with lower cost and higher efficiency and speed up data decision-making is of great significance to improving the competitiveness of enterprises.
At the same time, a wide variety of OLAP technologies have emerged in the industry (such as Spark-SQL, Presto, Apache Kylin, etc.). When selecting technologies, companies need to evaluate the performance and cost of different OLAP technologies.
This project is a bemchmark framework for performance cost evaluation of OLAP technology on the cloud.
Install python requirements with the following command.
$ pip install -r requirements.txtYou need an AWS account to run the benchmark on Amazon AWS. After the account creation, you need to get a AWS Access Key ID and a AWS Secret Access Key. Enter IAM service on the AWS console. Choose Users option in Access Management panel. Then, click Create access key button to get a AWS Access Key ID. You will also got the Secret Access Key here. Keep it in a safe place.
Configure your AWS Client with the following command:
aws configureYou need to input your AWS Access Key ID and AWS Secret Access Key here. You could also choose a region to set up your cluster on AWS.
Download this Benchmark with a git command:
git clone https://github.com/PasaLab/Raven.git
cd RavenYou need to input your username and password of GitHub to download it.
AWS uses a EC2 key pair to keep and manage clusters. Enter EC2 dashboard and choose Key Pairs option in Network & Security panel. Then, click Create key pair button. Enter your key pair name and use .pem format. You will get a .pem file to download. Download the file into the project under the ./cloud directory. You have to keep this file private in order to use it. For linux-base systems, run the following command:
chmod 600 ./cloud/*.pemFor new users to Amazon AWS, you must test if your account can create a cluster. Enter the EMR dashboard and click create cluster button. Use your EC2 key pair in Security and access option. Then, click create cluster. Now you need to wait for the cluster to launch. During the process here, you need to note the following security settings:
- Your subnet ID
- Your security groups for Master
- Your security groups for Core
After recording these settings, you could shut down the test cluster.
You need to configure instances and applications to install for your cluster, please refer to this instruction. Then, create a cluster for the benchmark:
python3 ./prepare.pyThe application will monitor the process of cluster creation. After the end of the application, users can see the info of the created nodes of the cluster in ./cloud/instances. Here, you can see the public and privete DNS addresses of all nodes in the cluster. The first line refers to the master nodes, and slave nodes are in the following lines.
Connect to the nodes of the cluster with the .pem key file and public DNS address, like:
ssh -i "./cloud/YOURKEYNAME.pem" hadoop@ec2-a-b-c-d.YOURREGION.compute.amazonaws.comThis command is also available in ./cloud/instances.
Now you can create the environment needed for benchmark testing. Different engines have different environment set-up procedures. Please follow the instructions below:
| Engine Name | Instruction |
|---|---|
| Spark-SQL | Get started with Spark-SQL |
| Presto | Get started with Presto |
| Apache Kylin | Get started with Apache Kylin |
You need to go through different procedures to set up your workload. Please follow the instructions below:
| Workload Name | Instruction |
|---|---|
| TPC-H | Implementing TPC-H workload |
Test-plan-related configurations are located in ./config/testplans directory. For new users, they do not need to edit test plan files specifically.
Advanced users can add and remove stages, changing the name, description, concurrency, commands (for offline stages) and queries (for online stages) in the .yaml file.
This benchmark uses CWAgent to monitor metrics during the execution of queries. To make CWAgent available on AWS clusters, you need to install it on all instances of the cluster. Run the following commands in all instances of the cluster:
sudo yum -y install collectd
wget https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
sudo rpm -U ./amazon-cloudwatch-agent.rpmMetric-related configurations are in ./config/metrics directory. New users could use one of the files directly.
Advanced users can change the metrics to be calculated as well as the way to generate the total cost score with the calculated values above with weight. All formulae should follow the grammar rules of python's eval function.
This benchmark uses cloud watch service to get metric data. The configuration of CWAgent on AWS machines can be edited in ./cloud/cwaconfig.json. You can reference CWAgent documents on AWS to configure this file.
With the built environment, you need to perform some configuration to run tests. All configurations are in ./config directory on the master node.
./config/config.yaml defines the templates to be used for the benchmark. For example:
engine: spark-sql/kylin/presto
workload: tpc-h
test_plan: one-pass/one-pass-concurrent/one-offline-multi-online
metrics: all/timeSwitch to your machine, use scp command to send configured project to your machine:
# in your benchmark directory
scp -i ./cloud/YOURPEM.pem -r hadoop@ec2-a-b-c-d.YOURREGION.compute.amazonaws.com:~/Raven/config ./
scp -i ./cloud/YOURPEM.pem -r hadoop@ec2-a-b-c-d.YOURREGION.compute.amazonaws.com:~/Raven/cloud/cwaconfig.json ./cloud/Public DNS address of the master node of the cluster is needed.
Then, send these files to slave nodes of the cluster.
# in your benchmark directory
cd ..
scp -i ./Raven/cloud/YOURPEM.pem -r ./Raven hadoop@ec2-a-b-c-d.YOURREGION.compute.amazonaws.com:~/Public DNS addresses of the slave nodes of the cluster is needed.
Use the following command to generate the workloads:
# on master node
cd ~/Raven
python3 main.py generateLaunch CWAgent on all machines of the cluster:
# on all machines of the cluster
$ sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -s -c file:/home/hadoop/Raven/cloud/cwaconfig.jsonThen, launch the benchmark on the master node:
# on master node
$ cd ~/Raven
$ python3 main.py runThe user needs to remember the time when the benchmark starts and finishes.
After execution, stop CWAgent on all machines:
# on all machines of the cluster
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a stopMetrics contains three parts. online_time and offline_time are timestamps of all commands and queries, which are stored in the benchmark after execution. Other metrics are saved in CWAgent, which needs to collect by calling the cloud watch service.
Users could use scp command to download those timestamps to your machine:
# in your benchmark directory
mkdir metrics
scp -i ./cloud/YOURPEM.pem -r hadoop@ec2-a-b-c-d.YOURREGION.compute.amazonaws.com:~/Raven/offline_times ./metrics/
scp -i ./cloud/YOURPEM.pem -r hadoop@ec2-a-b-c-d.YOURREGION.compute.amazonaws.com:~/Raven/online_times ./metrics/Other metrics can be collected when running the calculation script. It will also be saved in ./metrics/metric.
You have known the start and finish time of the application. With that two timestamps, run the following commands to get the cost score:
python3 ./monitor.py j-YOURCLUSTERIDYour cluster ID is available in ./log/benchmark.log of your local machine. If you have downloaded the metric file of CWAgent in ./metrics/metric, you can run
python3 ./monitor.py -1 START_TIME FINISH_TIMEinstead to avoid the time limit of collecting metrics on CWAgent.
The benchmark will give you the final score on the screen. Since the benchmark finished, now you can stop the cluster and release all resources.
aws emr terminate-clusters --cluster-ids j-YOURCLUSTERIDYour cluster ID is available in your command given above.
Raven is under the Apache 2.0 license. See the LICENSE file for details.
