Manage Ceph CRUSH Map

June 1, 2024 · 6 min read

Infrastructure Engineer at VNG

Have you ever thought about how to read your Ceph cluster CRUSH map?

If so then this blog post is for you.

Intro

Ceph, a scalable distributed storage system, uses the CRUSH (Controlled Replication Under Scalable Hashing) algorithm to manage data placement across OSDs. It defines how data is distributed and replicated within the cluster.

Get the CRUSH map

We can get the binary crushmap and convert it to human-readable version using these commands:

$ ceph osd getcrushmap -o crushmap_binary

$ file crushmap_binary
crushmap_binary: GLS_BINARY_LSB_FIRST

$ crushtool -d crushmap_binary -o crushmap_human

This is a sample output from the cluster, I will put the in-line explain (starts with # explain:) to there:

crushmap_human
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54
# explain: These controls the behavior of how CRUSH algorithm,
# which eventually impacts the data placement

# devices
device 0 osd.0 class ssd
device 1 osd.1 class ssd
...
device 16 osd.16 class ssd
# explain: It defines a list of the OSDs in the cluster
# with each of them assigned a unique ID

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 zone
type 10 region
type 11 root
# explain: This defines the hierarchy used in CRUSH map
# representing level of data organization
# the smallest unit type is osd
# Example: later we can create our replicated rule based on 
# host failure-domain or rack failure-domain


# buckets
root default {
        id -1           # do not change unnecessarily
        id -4 class ssd         # do not change unnecessarily
        # weight 0.00000
        alg straw2
        hash 0  # rjenkins1
}
host sangnn-c2 {
        id -5           # do not change unnecessarily
        id -6 class ssd         # do not change unnecessarily
        # weight 7.00000
        alg straw2
        hash 0  # rjenkins1
        item osd.1 weight 1.00000
        ...
        item osd.8 weight 1.00000
}
host sangnn-c3 {
        id -9           # do not change unnecessarily
        id -10 class ssd                # do not change unnecessarily
        # weight 7.00000
        alg straw2
        hash 0  # rjenkins1
        item osd.10 weight 1.00000
        ...
        item osd.16 weight 1.00000
}
host sangnn-c1 {
        id -7           # do not change unnecessarily
        id -8 class ssd         # do not change unnecessarily
        # weight 2.00000
        alg straw2
        hash 0  # rjenkins1
        item osd.7 weight 1.00000
        item osd.0 weight 1.00000
}
root ssd-01 {
        id -2           # do not change unnecessarily
        id -3 class ssd         # do not change unnecessarily
        # weight 16.00000
        alg straw2
        hash 0  # rjenkins1
        item sangnn-c2 weight 7.00000
        item sangnn-c3 weight 7.00000
        item sangnn-c1 weight 2.00000
}
# explain: Bucket defines the relationship between types
# higher-level buckets (type `root ssd-01`) will contain 
# the lower-level ones (type `host`)
# The current default alg and hash are `straw2` and `rjenkins1` correspondingly
# Example: the `root ssd-01` contains 3 hosts `sangnn-c2, sangnn-c3, sangnn-c1`


# rules
rule replicated_rule {
        id 0
        type replicated
        step take default
        step chooseleaf firstn 0 type host
        step emit
}
rule replicated_ssd {
        id 1
        type replicated
        step take ssd-01 class ssd
        step chooseleaf firstn 0 type host
        step emit
}
# explain: Rules define how data is replicated across the cluster.
# type we can specify `replicated` or `erasure`
# `step take`: Selects a bucket as the starting point
# `step chooseleaf`: Chooses devices within the specified bucket type (e.g., `host`)
# `step emit`: Finalizes the placement.


# end crush map

One way to show the logical CRUSH map is via ceph osd tree command:

$ ceph osd tree
ID   CLASS  WEIGHT    TYPE NAME           STATUS  REWEIGHT  PRI-AFF
 -2         16.00000  root ssd-01
 -7          2.00000      host sangnn-c1
  0    ssd   1.00000          osd.0           up   1.00000  1.00000
  7    ssd   1.00000          osd.7           up   1.00000  1.00000
 -5          7.00000      host sangnn-c2
  1    ssd   1.00000          osd.1           up   1.00000  1.00000
  2    ssd   1.00000          osd.2           up   1.00000  1.00000
  3    ssd   1.00000          osd.3           up   1.00000  1.00000
  4    ssd   1.00000          osd.4           up   1.00000  1.00000
  5    ssd   1.00000          osd.5           up   1.00000  1.00000
  6    ssd   1.00000          osd.6           up   1.00000  1.00000
  8    ssd   1.00000          osd.8           up   1.00000  1.00000
 -9          7.00000      host sangnn-c3
  9    ssd   1.00000          osd.9           up   1.00000  1.00000
 10    ssd   1.00000          osd.10          up   1.00000  1.00000
 11    ssd   1.00000          osd.11          up   1.00000  1.00000
 12    ssd   1.00000          osd.12          up   1.00000  1.00000
 13    ssd   1.00000          osd.13          up   1.00000  1.00000
 15    ssd   1.00000          osd.15          up   1.00000  1.00000
 16    ssd   1.00000          osd.16          up   1.00000  1.00000
 -1                0  root default

Inject the map

We can modify the crushmap_human file, save it and import it to the cluster via:

$ crushtool -c crushmap_human -o newcrushmap

$ ceph osd setcrushmap -i newcrushmap

Intro​

Get the CRUSH map​

Inject the map​

Intro

Get the CRUSH map

Inject the map