Posted
over 6 years
ago
by
Chris Dent
Some notes on exploring how an OpenStack placement
service behaves at scale.
The initial challenge is setting up a useful environment. To exercise
placement well we need either or both of lots of instances and lots of
resource providers (in the form
... [More]
of compute nodes where those instances
can land). In the absence of unlimited hardware this needs to be faked
in some fashion.
Thankfully, devstack
provides ways to make use of the fake virt driver to boot fake
instances that don't consume much in the way of resources (but follow
the true API during the spawn process), and to create multiple
nova-compute processes on the same host to manage those fake
instances.
The process of figuring out how to make this go was a combination of
grep, talking to people, and trying and failing multiple times. This
summary is much tidier than the "omg, I have no idea what I'm doing"
process of fail and fail again that led to it.
Also note that I'm not doing formal benchmarking here. Rather I'm
doing human observation of where things go wrong, what variables are
involved and how things feel. This is an important precursor to real
benchmarking to have a clue how the system works. The set up I'm using
would not be ideal for benchmarking, for example, as the VMs I'm using
are on the same physical host (in this case a dual xeon es-2620, 32
GB server running esxi) meaning they impact each other (especially
given the way I've configured the VMs), and aren't subject to physical
networking.
Another thing to note is that while a lot of this experimentation could
be automated, not doing so gives me deeper insight into how things
work, exposes bugs that need to be fixed, and has all the usual
benefits gained from doing things "the hard way". For formal testing
(where repeating things is paramount) all this faffing about by humans
would not be good. But for this, it is.
I eventually landed on the following set up with two VMs, one as the
control plane (ds1), one as the compute host (cn1).
ds1 is a 16 core, 16GB VM. It's hosting control plane services and
mysql and rabbitmq. This is where the scheduler and placement run.
cn1 is a 10 core, 11GB VM and is running 75 nova-compute process,
the metadata server, and neutron agent.
To limit message bus traffic, notifications are configured to only
send unversioned rather than the default of both. There's
currently no easy way to disable notifications entirely.
The "Noop" quota driver is used because we don't want to care about
quotas in this case.
The filter scheduler is used, but all filters are turned off.
These last two tricks were learned from some devstack experiments by
Matt Riedemann.
Both VMs are Ubuntu Artful, both are using master for all the OpenStack
services, except for devstack itself, which needs this
fix (to a bug caused by me).
The devstack configurations are relatively straightforward, the
important pieces are:
Setting the virt driver: VIRT_DRIVER=fake
Telling devstack how many fake compute nodes we want:
NUMBER_FAKE_NOVA_COMPUTE=75. This will create multiple compute
nodes each of which uses a common config file, plus a config file
unique to the process that sets the host name of the nova-compute
process (required to get unique resource providers).
Manipulating the nova.conf with a [[post-config|$NOVA_CONF]]
section to set a few things.
The local.conf for cn1 (the compute host) is:
[[local|localrc]]
HOST_IP=192.168.1.149
SERVICE_HOST=192.168.1.76
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
MULTI_HOST=1
MYSQL_HOST=$SERVICE_HOST
RABBIT_HOST=$SERVICE_HOST
GLANCE_HOSTPORT=$SERVICE_HOST:9292
RECLONE=yes
ENABLED_SERVICES=n-cpu,q-agt,n-api-meta,placement-client
VIRT_DRIVER=fake
NUMBER_FAKE_NOVA_COMPUTE=75
[[post-config|$NOVA_CONF]]
[quota]
driver = "nova.quota.NoopQuotaDriver"
[filter_scheduler]
enabled_filters = '""'
[notifications]
notification_format = unversioned
I'm using static IPs because it makes things easier. If you are trying
to repeat this in your own environment your HOST_IP and
SERVICE_HOST will likely be different. Everything else ought to be
the same. Explicitly setting ENABLED_SERVICES ensures that only the
stuff you really need is running. See Multi-Node
Lab
for some more information on multi-node devstack (Note that there is a
lot in there you don't need to care about if you aren't actually going
to use the VMs that you create in the deployment).
The local.conf for the control plane (ds1) mostly uses defaults but
disables some services that we don't care about, and adjusts the nova
config as required:
[[local|localrc]]
ADMIN_PASSWORD=secret
DATABASE_PASSWORD=$ADMIN_PASSWORD
RABBIT_PASSWORD=$ADMIN_PASSWORD
SERVICE_PASSWORD=$ADMIN_PASSWORD
MULTI_HOST=1
VIRT_DRIVER=fake
RECLONE=True
disable_service horizon
disable_service dstat
disable_service tempest
disable_service n-cpu
disable_service q-agt
disable_service n-api-meta
[[post-config|$NOVA_CONF]]
[quota]
driver = "nova.quota.NoopQuotaDriver"
[filter_scheduler]
enabled_filters = '""'
[notifications]
notification_format = unversioned
Note that we are disabling the services that will be running on the
compute host.
There are redundancies between these two files. Some of the stuff
required by one is in the other. This is because I started out with
nova-compute on both hosts and haven't fully rationalized the
local.conf files.
Now that we know what we're building we can build it. The control
plane (ds1) needs to be in place first so build devstack there first:
cd wherever_devstack_is
./stack.sh
and wait. When it completes do the same on the compute host (cn1).
When that is done, the control host needs to be made aware of the
compute host, after which you can verify the presence of the 75
hypervisors:
. openrc admin admin
nova-manage cell_v2 discover_hosts
openstack hypervisor list
Playing With It
Once all that is done it is possible to send a few different workload
patterns at the service. It's hard to do this in a way that isolates
any particular service as they all interact so much.
In my first round of experiments, yesterday, I tried a few different
scenarios to get a sense of how things worked and what variables exist.
When booting a large number of servers from a small number of nova
boot commands with a high min-count (e.g., 1000) the placement api
processes are lost as noise in the face of the much greater effort being
made by nova-conductor.
It is only when a larger number of smaller requests (15 concurrent
requests for 50 instances each) are made that the placement API begins to show
any signs that it is working hard. This is about what you would
expect: talking to /allocation_candidates is certainly where most
effort happens and most data is processed.
Today I decided to narrow things down to making lots of parallel boots
of single instances, to impact the placement service as much as possible.
If you intend to start many nova boot (or openstack server)
commands at the same time, make sure you do them from a third machine.
I tried to do 300 nova boot commands, and pushed my load average
over 400 and brought the world to a complete stop.
In the current devstack (February 2018) we can use built in flavor and image
references when making a boot request. In addition, since we are making fakes
we can set the nic to none. This boots one server named foobar using the
m1.tiny flavor:
nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk foobar
We can boot 1000 of those with:
nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk -min-count 1000 foobar
Each instance will get a numeric suffix. As stated above this doesn't
stress placement much.
If we do want to stress placement we need to increase the number of
concurrent requests to GET /allocation_candidates, at which point
the number of instances per boot request is less of an issue. One way
to do this is to background a mess of boot commands:
for i in {1..100}; do \
nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk ${i}-foobar &
done
But more often than not this will cause the calls to the
nova-scheduler process to timeout when the conductor tries to call
select_destinations. We can work around this by hacking
nova-scheduler to have more workers. Since this is something that
requires a hack presumably there's a reason for it.
diff --git a/nova/cmd/scheduler.py b/nova/cmd/scheduler.py
index 51d5aee4ac..d794eacaf3 100644
--- a/nova/cmd/scheduler.py
+++ b/nova/cmd/scheduler.py
@@ -45,5 +45,5 @@ def main():
server = service.Service.create(binary='nova-scheduler',
topic=scheduler_rpcapi.RPC_TOPIC)
- service.serve(server)
+ service.serve(server, workers=4)
service.wait()
Running four nova-scheduler workers, the above nova boot command works
fine with no timeout. However, code to do this was never
merged for reasons (which
may or may not still be valid with the existence of placement)
discussed on the review and in a related
email.
Then I tried:
for i in {1..500}; do \
nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk ${i}-foobar & \
done
500 parallel boots. This caused the Apache process (which provides a
front end to keystone, glance, the compute api, and placement) to
freeze up and need MaxRequestWorkers raised. Apache (in a default
configuration) is a pretty weak link in this stuff. It's easy to see why
people prefer nginx in situations where all the web server is really
doing is being a reverse proxy.
Once Apache is sorted, then it is my (non-VM) machine doing the nova
boots that suffers. It seems that 500 nova boot that are doing
actual work instead of just timing out trying to contact a stuck web
server is not a happy way to be. 15 minutes later it woke up and boots
started. shrug.
At which point select_destinations started timing out again. 4
workers not enough? I can (and did) raise it to eight but it doesn't
change the fact that my 500 parallel nova boot commands get stuck if
run from one machine, and at the moment I've run out of free hardware.
So instead I've spread the load a bit:
for j in {1..10} ; do \
for i in {1..50}; do \
nova boot --flavor 1 --nic none --image cirros-0.3.5-x86_64-disk ${i}-foobar & \
done ; \
sleep 60 ; \
done
After this I get 500 ACTIVE instances in fairly short order. The
processes which seem to get the most work are cell1 conductor,
interleaved with the nova-scheduler.
At this stage it makes sense to check that the placement database has
expected data:
1500 allocations: correct. (3 for each instance)
75 resource providers: correct.
Then it is time to delete all those servers:
openstack server list -f value -c ID | xargs openstack server delete
While that is happening it is again the cell conductor that sweats.
0 allocations in the placement db when that's done. ✔
Random Observations
Some thoughts that didn't quit fit in anywhere else:
We know this already, but an idle compute-manager is fairly chatty
with the placement service. If you have 75 of them, that chat starts
to add up: Approximately 246 requests per minute, checking on the
state of inventory and allocations. Work is already in progress to
investigate this, but it should be noted that the placement service
handles this traffic with aplomb. In fact at no point during the entire
exercise did the placement service sweat.
It makes sense that if you're going to have 8 conductors you want at
least 8 schedulers?
This stuff simply won't work without multiple scheduler workers. If
the rpc timeout limit is raised that can make things work but only
very slowly. This suggests that it is important for us to a) make
sure that multiple workers is safe, b) change the code (as the diff
above) so that workers can happen, c) recommend doing it.
The placement UWSGI processes appear fairly stable memory-wise.
It's important to note that no traits, custom resource classes,
nested resource providers, aggregates or shared resource providers
are used here. Having any of those in the loop could impact the
profile. We don't yet know.
The control plane host is working at full tilt through all of this.
The compute host not much at all (because it is fake). This
suggests that distributing the control plane services broadly is
important. I will probably try to integrate these experiments with
my placement container
experiments, putting those
containers on a different host. It looks like having the cell conductor
elsewhere would be interesting to observe as well.
Doing this kind of thing is a huge learning experience and a
valuable use of time (despite taking a lot of time). I wish I could
remember that more often.
[Less]
|