Skip to main content

Manage Ceph CRUSH Map

· 6 min read
Sang, Nguyen Nhat
Infrastructure Engineer at VNG

Have you ever thought about how to read your Ceph cluster CRUSH map?

If so then this blog post is for you.

Intro

Ceph, a scalable distributed storage system, uses the CRUSH (Controlled Replication Under Scalable Hashing) algorithm to manage data placement across OSDs. It defines how data is distributed and replicated within the cluster.

Get the CRUSH map

We can get the binary crushmap and convert it to human-readable version using these commands:

$ ceph osd getcrushmap -o crushmap_binary

$ file crushmap_binary
crushmap_binary: GLS_BINARY_LSB_FIRST

$ crushtool -d crushmap_binary -o crushmap_human

This is a sample output from the cluster, I will put the in-line explain (starts with # explain:) to there:

crushmap_human
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# explain: These controls the behavior of how CRUSH algorithm,
# which eventually impacts the data placement

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
...
device 16 osd.16 class ssd
# explain: It defines a list of the OSDs in the cluster
# with each of them assigned a unique ID

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# explain: This defines the hierarchy used in CRUSH map
# representing level of data organization
# the smallest unit type is osd
# Example: later we can create our replicated rule based on
# host failure-domain or rack failure-domain


# buckets
root default {
id -1 # do not change unnecessarily
id -4 class ssd # do not change unnecessarily
# weight 0.00000
alg straw2
hash 0 # rjenkins1
}
host sangnn-c2 {
id -5 # do not change unnecessarily
id -6 class ssd # do not change unnecessarily
# weight 7.00000
alg straw2
hash 0 # rjenkins1
item osd.1 weight 1.00000
...
item osd.8 weight 1.00000
}
host sangnn-c3 {
id -9 # do not change unnecessarily
id -10 class ssd # do not change unnecessarily
# weight 7.00000
alg straw2
hash 0 # rjenkins1
item osd.10 weight 1.00000
...
item osd.16 weight 1.00000
}
host sangnn-c1 {
id -7 # do not change unnecessarily
id -8 class ssd # do not change unnecessarily
# weight 2.00000
alg straw2
hash 0 # rjenkins1
item osd.7 weight 1.00000
item osd.0 weight 1.00000
}
root ssd-01 {
id -2 # do not change unnecessarily
id -3 class ssd # do not change unnecessarily
# weight 16.00000
alg straw2
hash 0 # rjenkins1
item sangnn-c2 weight 7.00000
item sangnn-c3 weight 7.00000
item sangnn-c1 weight 2.00000
}
# explain: Bucket defines the relationship between types
# higher-level buckets (type `root ssd-01`) will contain
# the lower-level ones (type `host`)
# The current default alg and hash are `straw2` and `rjenkins1` correspondingly
# Example: the `root ssd-01` contains 3 hosts `sangnn-c2, sangnn-c3, sangnn-c1`


# rules
rule replicated_rule {
id 0
type replicated
step take default
step chooseleaf firstn 0 type host
step emit
}
rule replicated_ssd {
id 1
type replicated
step take ssd-01 class ssd
step chooseleaf firstn 0 type host
step emit
}
# explain: Rules define how data is replicated across the cluster.
# type we can specify `replicated` or `erasure`
# `step take`: Selects a bucket as the starting point
# `step chooseleaf`: Chooses devices within the specified bucket type (e.g., `host`)
# `step emit`: Finalizes the placement.


# end crush map

One way to show the logical CRUSH map is via ceph osd tree command:

$ ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-2 16.00000 root ssd-01
-7 2.00000 host sangnn-c1
0 ssd 1.00000 osd.0 up 1.00000 1.00000
7 ssd 1.00000 osd.7 up 1.00000 1.00000
-5 7.00000 host sangnn-c2
1 ssd 1.00000 osd.1 up 1.00000 1.00000
2 ssd 1.00000 osd.2 up 1.00000 1.00000
3 ssd 1.00000 osd.3 up 1.00000 1.00000
4 ssd 1.00000 osd.4 up 1.00000 1.00000
5 ssd 1.00000 osd.5 up 1.00000 1.00000
6 ssd 1.00000 osd.6 up 1.00000 1.00000
8 ssd 1.00000 osd.8 up 1.00000 1.00000
-9 7.00000 host sangnn-c3
9 ssd 1.00000 osd.9 up 1.00000 1.00000
10 ssd 1.00000 osd.10 up 1.00000 1.00000
11 ssd 1.00000 osd.11 up 1.00000 1.00000
12 ssd 1.00000 osd.12 up 1.00000 1.00000
13 ssd 1.00000 osd.13 up 1.00000 1.00000
15 ssd 1.00000 osd.15 up 1.00000 1.00000
16 ssd 1.00000 osd.16 up 1.00000 1.00000
-1 0 root default

Inject the map

We can modify the crushmap_human file, save it and import it to the cluster via:

$ crushtool -c crushmap_human -o newcrushmap

$ ceph osd setcrushmap -i newcrushmap