General Detection And Deallocation Of Failing Components - IBM Power 780 Technical Overview And Introduction

Hide thumbs Also See for Power 780:

Overview (59 pages)

Table Of Contents

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

220

page of 220

/ 220
Contents
Table of Contents
Bookmarks

Table of Contents

minimum capacity, the partitions are allowed to start or continue running. If processor

capacity is insufficient to run a partition at its minimum value, then starting that partition

results in an error condition that must be resolved.

4.2.2 General detection and deallocation of failing components

Runtime correctable or recoverable errors are monitored to determine if there is a pattern of

errors. If these components reach a predefined error limit, the service processor initiates an

action to deconfigure the faulty hardware, helping to avoid a potential system outage and to

enhance system availability.

Persistent deallocation

To enhance system availability, a component that is identified for deallocation or

deconfiguration on a POWER processor-based system is flagged for persistent deallocation.

Component removal can occur either dynamically (while the system is running) or at boot

time (initial program load, or IPL), depending both on the type of fault and when the fault is

detected.

In addition, runtime unrecoverable hardware faults can be deconfigured from the system after

the first occurrence. The system can be rebooted immediately after failure and resume

operation on the remaining stable hardware. This way prevents the same faulty hardware

from affecting system operation again. The repair action is deferred to a more convenient,

less critical time.

Persistent deallocation functions include the following items:

Processor

L2/L3 cache lines (cache lines are dynamically deleted)

Memory

Deconfigure or bypass failing I/O adapters

Processor instruction retry

As in POWER6, the POWER7 and POWER7+ processor has the ability to retry processor

instruction and alternate processor recovery for a number of core related faults. This ability

significantly reduces exposure to both permanent and intermittent errors in the processor

core.

Intermittent errors, often because of cosmic rays or other sources of radiation, are generally

not repeatable.

With this function, when an error is encountered in the core, in caches and certain logic

functions, the POWER7 and POWER7+ processor first automatically retries the instruction. If

the source of the error was truly transient, the instruction succeeds and the system continues

as before.

On IBM systems prior to POWER6, this error caused a checkstop.

Alternate processor retry

Hard failures are more difficult, being permanent errors that are replicated each time that the

instruction is repeated. Retrying the instruction does not help in this situation because the

instruction will continue to fail.

As in POWER6, the POWER7 and POWER7+ processors have the ability to extract the failing

instruction from the faulty core and retry it elsewhere in the system for a number of faults,

after which the failing core is dynamically deconfigured and scheduled for replacement.

164

IBM Power 770 and 780 (9117-MMD, 9179-MHD) Technical Overview and Introduction

Table of Contents

Show Quick Links

Quick Links:
Ibm Power 770 Server

Hide quick links:

Table of Contents

This manual is also suitable for:

Power 770

General Detection And Deallocation Of Failing Components - IBM Power 780 Technical Overview And Introduction

4.2.2 General detection and deallocation of failing components

Hide quick links:

Related Manuals for IBM Power 780

Related Content for IBM Power 780

This manual is also suitable for:

Table of Contents