blob: 38883276d31c379fe6de344537b08d0b4e10e6a9
1 | The cluster MD is a shared-device RAID for a cluster. |
2 | |
3 | |
4 | 1. On-disk format |
5 | |
6 | Separate write-intent-bitmaps are used for each cluster node. |
7 | The bitmaps record all writes that may have been started on that node, |
8 | and may not yet have finished. The on-disk layout is: |
9 | |
10 | 0 4k 8k 12k |
11 | ------------------------------------------------------------------- |
12 | | idle | md super | bm super [0] + bits | |
13 | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | |
14 | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | |
15 | | bm bits [3, contd] | | | |
16 | |
17 | During "normal" functioning we assume the filesystem ensures that only |
18 | one node writes to any given block at a time, so a write request will |
19 | |
20 | - set the appropriate bit (if not already set) |
21 | - commit the write to all mirrors |
22 | - schedule the bit to be cleared after a timeout. |
23 | |
24 | Reads are just handled normally. It is up to the filesystem to ensure |
25 | one node doesn't read from a location where another node (or the same |
26 | node) is writing. |
27 | |
28 | |
29 | 2. DLM Locks for management |
30 | |
31 | There are three groups of locks for managing the device: |
32 | |
33 | 2.1 Bitmap lock resource (bm_lockres) |
34 | |
35 | The bm_lockres protects individual node bitmaps. They are named in |
36 | the form bitmap000 for node 1, bitmap001 for node 2 and so on. When a |
37 | node joins the cluster, it acquires the lock in PW mode and it stays |
38 | so during the lifetime the node is part of the cluster. The lock |
39 | resource number is based on the slot number returned by the DLM |
40 | subsystem. Since DLM starts node count from one and bitmap slots |
41 | start from zero, one is subtracted from the DLM slot number to arrive |
42 | at the bitmap slot number. |
43 | |
44 | The LVB of the bitmap lock for a particular node records the range |
45 | of sectors that are being re-synced by that node. No other |
46 | node may write to those sectors. This is used when a new nodes |
47 | joins the cluster. |
48 | |
49 | 2.2 Message passing locks |
50 | |
51 | Each node has to communicate with other nodes when starting or ending |
52 | resync, and for metadata superblock updates. This communication is |
53 | managed through three locks: "token", "message", and "ack", together |
54 | with the Lock Value Block (LVB) of one of the "message" lock. |
55 | |
56 | 2.3 new-device management |
57 | |
58 | A single lock: "no-new-dev" is used to co-ordinate the addition of |
59 | new devices - this must be synchronized across the array. |
60 | Normally all nodes hold a concurrent-read lock on this device. |
61 | |
62 | 3. Communication |
63 | |
64 | Messages can be broadcast to all nodes, and the sender waits for all |
65 | other nodes to acknowledge the message before proceeding. Only one |
66 | message can be processed at a time. |
67 | |
68 | 3.1 Message Types |
69 | |
70 | There are six types of messages which are passed: |
71 | |
72 | 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has |
73 | been updated, and the node must re-read the md superblock. This is |
74 | performed synchronously. It is primarily used to signal device |
75 | failure. |
76 | |
77 | 3.1.2 RESYNCING: informs other nodes that a resync is initiated or |
78 | ended so that each node may suspend or resume the region. Each |
79 | RESYNCING message identifies a range of the devices that the |
80 | sending node is about to resync. This over-rides any pervious |
81 | notification from that node: only one ranged can be resynced at a |
82 | time per-node. |
83 | |
84 | 3.1.3 NEWDISK: informs other nodes that a device is being added to |
85 | the array. Message contains an identifier for that device. See |
86 | below for further details. |
87 | |
88 | 3.1.4 REMOVE: A failed or spare device is being removed from the |
89 | array. The slot-number of the device is included in the message. |
90 | |
91 | 3.1.5 RE_ADD: A failed device is being re-activated - the assumption |
92 | is that it has been determined to be working again. |
93 | |
94 | 3.1.6 BITMAP_NEEDS_SYNC: if a node is stopped locally but the bitmap |
95 | isn't clean, then another node is informed to take the ownership of |
96 | resync. |
97 | |
98 | 3.2 Communication mechanism |
99 | |
100 | The DLM LVB is used to communicate within nodes of the cluster. There |
101 | are three resources used for the purpose: |
102 | |
103 | 3.2.1 token: The resource which protects the entire communication |
104 | system. The node having the token resource is allowed to |
105 | communicate. |
106 | |
107 | 3.2.2 message: The lock resource which carries the data to |
108 | communicate. |
109 | |
110 | 3.2.3 ack: The resource, acquiring which means the message has been |
111 | acknowledged by all nodes in the cluster. The BAST of the resource |
112 | is used to inform the receiving node that a node wants to |
113 | communicate. |
114 | |
115 | The algorithm is: |
116 | |
117 | 1. receive status - all nodes have concurrent-reader lock on "ack". |
118 | |
119 | sender receiver receiver |
120 | "ack":CR "ack":CR "ack":CR |
121 | |
122 | 2. sender get EX on "token" |
123 | sender get EX on "message" |
124 | sender receiver receiver |
125 | "token":EX "ack":CR "ack":CR |
126 | "message":EX |
127 | "ack":CR |
128 | |
129 | Sender checks that it still needs to send a message. Messages |
130 | received or other events that happened while waiting for the |
131 | "token" may have made this message inappropriate or redundant. |
132 | |
133 | 3. sender writes LVB. |
134 | sender down-convert "message" from EX to CW |
135 | sender try to get EX of "ack" |
136 | [ wait until all receivers have *processed* the "message" ] |
137 | |
138 | [ triggered by bast of "ack" ] |
139 | receiver get CR on "message" |
140 | receiver read LVB |
141 | receiver processes the message |
142 | [ wait finish ] |
143 | receiver releases "ack" |
144 | receiver tries to get PR on "message" |
145 | |
146 | sender receiver receiver |
147 | "token":EX "message":CR "message":CR |
148 | "message":CW |
149 | "ack":EX |
150 | |
151 | 4. triggered by grant of EX on "ack" (indicating all receivers |
152 | have processed message) |
153 | sender down-converts "ack" from EX to CR |
154 | sender releases "message" |
155 | sender releases "token" |
156 | receiver upconvert to PR on "message" |
157 | receiver get CR of "ack" |
158 | receiver release "message" |
159 | |
160 | sender receiver receiver |
161 | "ack":CR "ack":CR "ack":CR |
162 | |
163 | |
164 | 4. Handling Failures |
165 | |
166 | 4.1 Node Failure |
167 | |
168 | When a node fails, the DLM informs the cluster with the slot |
169 | number. The node starts a cluster recovery thread. The cluster |
170 | recovery thread: |
171 | |
172 | - acquires the bitmap<number> lock of the failed node |
173 | - opens the bitmap |
174 | - reads the bitmap of the failed node |
175 | - copies the set bitmap to local node |
176 | - cleans the bitmap of the failed node |
177 | - releases bitmap<number> lock of the failed node |
178 | - initiates resync of the bitmap on the current node |
179 | md_check_recovery is invoked within recover_bitmaps, |
180 | then md_check_recovery -> metadata_update_start/finish, |
181 | it will lock the communication by lock_comm. |
182 | Which means when one node is resyncing it blocks all |
183 | other nodes from writing anywhere on the array. |
184 | |
185 | The resync process is the regular md resync. However, in a clustered |
186 | environment when a resync is performed, it needs to tell other nodes |
187 | of the areas which are suspended. Before a resync starts, the node |
188 | send out RESYNCING with the (lo,hi) range of the area which needs to |
189 | be suspended. Each node maintains a suspend_list, which contains the |
190 | list of ranges which are currently suspended. On receiving RESYNCING, |
191 | the node adds the range to the suspend_list. Similarly, when the node |
192 | performing resync finishes, it sends RESYNCING with an empty range to |
193 | other nodes and other nodes remove the corresponding entry from the |
194 | suspend_list. |
195 | |
196 | A helper function, ->area_resyncing() can be used to check if a |
197 | particular I/O range should be suspended or not. |
198 | |
199 | 4.2 Device Failure |
200 | |
201 | Device failures are handled and communicated with the metadata update |
202 | routine. When a node detects a device failure it does not allow |
203 | any further writes to that device until the failure has been |
204 | acknowledged by all other nodes. |
205 | |
206 | 5. Adding a new Device |
207 | |
208 | For adding a new device, it is necessary that all nodes "see" the new |
209 | device to be added. For this, the following algorithm is used: |
210 | |
211 | 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues |
212 | ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CLUSTER_ADD) |
213 | 2. Node 1 sends a NEWDISK message with uuid and slot number |
214 | 3. Other nodes issue kobject_uevent_env with uuid and slot number |
215 | (Steps 4,5 could be a udev rule) |
216 | 4. In userspace, the node searches for the disk, perhaps |
217 | using blkid -t SUB_UUID="" |
218 | 5. Other nodes issue either of the following depending on whether |
219 | the disk was found: |
220 | ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and |
221 | disc.number set to slot number) |
222 | ioctl(CLUSTERED_DISK_NACK) |
223 | 6. Other nodes drop lock on "no-new-devs" (CR) if device is found |
224 | 7. Node 1 attempts EX lock on "no-new-dev" |
225 | 8. If node 1 gets the lock, it sends METADATA_UPDATED after |
226 | unmarking the disk as SpareLocal |
227 | 9. If not (get "no-new-dev" lock), it fails the operation and sends |
228 | METADATA_UPDATED. |
229 | 10. Other nodes get the information whether a disk is added or not |
230 | by the following METADATA_UPDATED. |
231 | |
232 | 6. Module interface. |
233 | |
234 | There are 17 call-backs which the md core can make to the cluster |
235 | module. Understanding these can give a good overview of the whole |
236 | process. |
237 | |
238 | 6.1 join(nodes) and leave() |
239 | |
240 | These are called when an array is started with a clustered bitmap, |
241 | and when the array is stopped. join() ensures the cluster is |
242 | available and initializes the various resources. |
243 | Only the first 'nodes' nodes in the cluster can use the array. |
244 | |
245 | 6.2 slot_number() |
246 | |
247 | Reports the slot number advised by the cluster infrastructure. |
248 | Range is from 0 to nodes-1. |
249 | |
250 | 6.3 resync_info_update() |
251 | |
252 | This updates the resync range that is stored in the bitmap lock. |
253 | The starting point is updated as the resync progresses. The |
254 | end point is always the end of the array. |
255 | It does *not* send a RESYNCING message. |
256 | |
257 | 6.4 resync_start(), resync_finish() |
258 | |
259 | These are called when resync/recovery/reshape starts or stops. |
260 | They update the resyncing range in the bitmap lock and also |
261 | send a RESYNCING message. resync_start reports the whole |
262 | array as resyncing, resync_finish reports none of it. |
263 | |
264 | resync_finish() also sends a BITMAP_NEEDS_SYNC message which |
265 | allows some other node to take over. |
266 | |
267 | 6.5 metadata_update_start(), metadata_update_finish(), |
268 | metadata_update_cancel(). |
269 | |
270 | metadata_update_start is used to get exclusive access to |
271 | the metadata. If a change is still needed once that access is |
272 | gained, metadata_update_finish() will send a METADATA_UPDATE |
273 | message to all other nodes, otherwise metadata_update_cancel() |
274 | can be used to release the lock. |
275 | |
276 | 6.6 area_resyncing() |
277 | |
278 | This combines two elements of functionality. |
279 | |
280 | Firstly, it will check if any node is currently resyncing |
281 | anything in a given range of sectors. If any resync is found, |
282 | then the caller will avoid writing or read-balancing in that |
283 | range. |
284 | |
285 | Secondly, while node recovery is happening it reports that |
286 | all areas are resyncing for READ requests. This avoids races |
287 | between the cluster-filesystem and the cluster-RAID handling |
288 | a node failure. |
289 | |
290 | 6.7 add_new_disk_start(), add_new_disk_finish(), new_disk_ack() |
291 | |
292 | These are used to manage the new-disk protocol described above. |
293 | When a new device is added, add_new_disk_start() is called before |
294 | it is bound to the array and, if that succeeds, add_new_disk_finish() |
295 | is called the device is fully added. |
296 | |
297 | When a device is added in acknowledgement to a previous |
298 | request, or when the device is declared "unavailable", |
299 | new_disk_ack() is called. |
300 | |
301 | 6.8 remove_disk() |
302 | |
303 | This is called when a spare or failed device is removed from |
304 | the array. It causes a REMOVE message to be send to other nodes. |
305 | |
306 | 6.9 gather_bitmaps() |
307 | |
308 | This sends a RE_ADD message to all other nodes and then |
309 | gathers bitmap information from all bitmaps. This combined |
310 | bitmap is then used to recovery the re-added device. |
311 | |
312 | 6.10 lock_all_bitmaps() and unlock_all_bitmaps() |
313 | |
314 | These are called when change bitmap to none. If a node plans |
315 | to clear the cluster raid's bitmap, it need to make sure no other |
316 | nodes are using the raid which is achieved by lock all bitmap |
317 | locks within the cluster, and also those locks are unlocked |
318 | accordingly. |
319 | |
320 | 7. Unsupported features |
321 | |
322 | There are somethings which are not supported by cluster MD yet. |
323 | |
324 | - update size and change array_sectors. |
325 |