þÿDe : Thibaut VARENE <varenet@esiee.fr> Date : Dim 16 juin 2002 12:50:55 Europe/Paris À : Ryan Bradetich <rbradetich@uswest.net> Objet : Hang again :) Hi! I woke up this morning, and noticed that the A500 got stuck again: I have noticed that if I stuck the lanconsole by issuing a command that would hang (ps -ef, w or so), I can gain controle over it again by SysRq 'I' (Kill all Tasks). SysRq 'E' (Terminate All Tasks) seems to work fine also. As you can see by comparing the output of SysRq 't' and the ps, it is the setiathome process 28484 who stuck the box. What is weird is that I could ls /proc/28484, but not cat /proc/28484/status (the cat would freeze). Below is various dumps: The very first sysrq t (after noticing that my 'w' command get stuck): telnet> send break SysRq : Show State free sibling task PC stack pid father child younger older init {{ flush_scheduled_tasks }} S 0000000000abcdef 112 1 0 29298 3 (NOTLB) keventd {{ do_fork }} S 0000000000abcdef 112 2 1 8 (L-TLB) ksoftirqd_CPU {{ do_fork }} S 0000000000abcdef 112 3 0 4 1 (L-TLB) ksoftirqd_CPU {{ do_fork }} S 0000000000abcdef 112 4 0 5 3 (L-TLB) kswapd {{ do_fork }} S 0000000000abcdef 112 5 0 6 4 (L-TLB) bdflush {{ do_fork }} S 0000000000abcdef 112 6 0 7 5 (L-TLB) kupdated {{ __wait_on_buffer }} S 0000000000abcdef 112 7 0 6 (L-TLB) scsi_eh_0 {{ do_fork }} S 0000000000abcdef 112 8 1 9 2 (L-TLB) scsi_eh_1 {{ do_fork }} S 0000000000abcdef 112 9 1 89 8 (L-TLB) portmap {{ __lock_page }} S 0000000000abcdef 112 89 1 137 9 (NOTLB) syslogd {{ __wait_on_buffer }} S 0000000000abcdef 112 137 1 140 89 (NOTLB) klogd {{ ___wait_on_page }} R 0000000000abcdef 112 140 1 143 137 (NOTLB) rpc.statd {{ __lock_page }} S 0000000000abcdef 112 143 1 149 140 (NOTLB) inetd {{ do_fork }} S 0000000000abcdef 0 149 1 158 143 (NOTLB) sshd {{ __lock_page }} S 0000000000abcdef 112 158 1 161 149 (NOTLB) ntpd {{ __wait_on_buffer }} S 0000000000abcdef 112 161 1 164 158 (NOTLB) rpc.nfsd {{ __lock_page }} S 0000000000abcdef 112 164 1 167 161 (NOTLB) rpc.mountd {{ do_fork }} S 0000000000abcdef 112 167 1 170 164 (NOTLB) cron {{ ___wait_on_page }} S 0000000000abcdef 112 170 1 29201 27910 167 (NOTLB) getty {{ flush_scheduled_tasks }} S 0000000000abcdef 112 27910 1 28489 170 (NOTLB) setiathome {{ __down_write }} D 0000000000abcdef 112 28484 1 29298 28489 (NOTLB) dnetc {{ __down_read }} D 0000000000abcdef 0 28489 1 28484 27910 (NOTLB) cron {{ do_fork }} S 0000000000abcdef 112 29201 170 29202 (NOTLB) sh {{ do_fork }} S 0000000000abcdef 0 29202 29201 29203 (NOTLB) run-parts {{ do_fork }} S 0000000000abcdef 0 29203 29202 29205 (NOTLB) lpr {{ ___wait_on_page }} S 0000000000abcdef 0 29205 29203 29236 (NOTLB) lpd {{ ___wait_on_page }} S 0000000000abcdef 112 29236 29205 29239 (NOTLB) lpd {{ do_fork }} S 0000000000abcdef 112 29239 29236 29241 (NOTLB) start-stop-da {{ __down_read }} D 0000000000abcdef 0 29241 29239 (NOTLB) w {{ __down_read }} D 0000000000abcdef 112 29298 1 28484 (NOTLB) a 'ps' issued just after the previous sysrq t: mkhppa3:~# ps -efljmw F S UID PID PPID PGID SID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD 100 S root 1 0 0 0 0 68 0 - 381 ? Jun15 ? 00:00:00 init 040 S root 2 1 1 1 0 69 0 - 0 ? Jun15 ? 00:00:00 [keventd] 040 S root 3 0 1 1 0 79 19 - 0 ? Jun15 ? 00:00:00 [ksoftirqd_CPU0] 040 S root 4 0 1 1 0 78 19 - 0 ? Jun15 ? 00:00:00 [ksoftirqd_CPU1] 040 S root 5 0 1 1 0 69 0 - 0 ? Jun15 ? 00:00:00 [kswapd] 040 S root 6 0 1 1 0 69 0 - 0 ? Jun15 ? 00:00:00 [bdflush] 040 S root 7 0 1 1 0 69 0 - 0 ? Jun15 ? 00:00:01 [kupdated] 040 S root 8 1 1 1 0 69 0 - 0 ? Jun15 ? 00:00:00 [scsi_eh_0] 040 S root 9 1 1 1 0 69 0 - 0 ? Jun15 ? 00:00:00 [scsi_eh_1] 140 S daemon 89 1 89 89 0 69 0 - 442 ? Jun15 ? 00:00:00 /sbin/portmap 040 S root 137 1 137 137 0 69 0 - 681 ? Jun15 ? 00:00:00 /sbin/syslogd 040 S root 140 1 140 140 0 69 0 - 573 ? Jun15 ? 00:00:00 /sbin/klogd 140 S root 143 1 143 143 0 69 0 - 479 ? Jun15 ? 00:00:00 /sbin/rpc.statd 140 S root 149 1 149 149 0 68 0 - 668 ? Jun15 ? 00:00:00 /usr/sbin/inetd 140 S root 158 1 158 158 0 68 0 - 911 ? Jun15 ? 00:00:00 /usr/sbin/sshd 140 S root 161 1 161 161 0 69 0 - 732 ? Jun15 ? 00:00:00 /usr/sbin/ntpd 040 S root 164 1 164 164 0 69 0 - 883 ? Jun15 ? 00:00:00 /usr/sbin/rpc.nfsd 040 S root 167 1 131 131 0 69 0 - 897 ? Jun15 ? 00:00:00 /usr/sbin/rpc.mountd 040 S root 170 1 170 170 0 68 0 - 563 ? Jun15 ? 00:00:00 /usr/sbin/cron 100 S root 27910 1 27910 27910 0 71 0 - 732 ? Jun15 ttyS0 00:00:00 -bash got stuck here. Here is what you get after SysRq 'I', which IIRC, kills all non-kernel tasks; as you can see, all processes in 'down_read' or 'down_write' state can't be killed. The last 'getty' process is normal: this is the serial console which comes back to life :) free sibling task PC stack pid father child younger older init {{ flush_scheduled_tasks }} S 0000000000abcdef 112 1 0 29327 3 (NOTLB) keventd {{ do_fork }} S 0000000000abcdef 112 2 1 8 (L-TLB) ksoftirqd_CPU {{ do_fork }} S 0000000000abcdef 112 3 0 4 1 (L-TLB) ksoftirqd_CPU {{ do_fork }} S 0000000000abcdef 112 4 0 5 3 (L-TLB) kswapd {{ do_fork }} S 0000000000abcdef 112 5 0 6 4 (L-TLB) bdflush {{ do_fork }} S 0000000000abcdef 112 6 0 7 5 (L-TLB) kupdated {{ __wait_on_buffer }} S 0000000000abcdef 112 7 0 6 (L-TLB) scsi_eh_0 {{ do_fork }} S 0000000000abcdef 112 8 1 9 2 (L-TLB) scsi_eh_1 {{ do_fork }} S 0000000000abcdef 112 9 1 28489 8 (L-TLB) setiathome {{ __down_write }} D 0000000000abcdef 112 28484 1 29298 28489 (NOTLB) dnetc {{ __down_read }} D 0000000000abcdef 0 28489 1 28484 9 (NOTLB) start-stop-da {{ __down_read }} D 0000000000abcdef 0 29241 1 29311 29298 (NOTLB) w {{ __down_read }} D 0000000000abcdef 112 29298 1 29241 28484 (NOTLB) ps {{ __down_read }} D 0000000000abcdef 0 29311 1 29317 29241 (NOTLB) ps {{ __down_read }} D 0000000000abcdef 112 29317 1 29326 29311 (NOTLB) cat {{ __down_read }} D 0000000000abcdef 8 29326 1 29327 29317 (NOTLB) getty {{ do_fork }} S 0000000000abcdef 112 29327 1 29326 (NOTLB) I also noticed sth that might be interesting: when SysRq 'K' (SAK), it complained that: "Stack pointer and cr30 do not correspond, dumping..." (iirc, the output backlog wasn't long enough to see it after the whole dump. At the end of the dump there was this: Kernel Fault: Code=26 regs=00000000104a5680 (Addr=0000000000000004) YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI PSW: 00001000000001000000011100001110 Not tainted r00-03 0000000000000000 000000001042b430 00000000102325b8 000000001042a430 r04-07 000000001042a430 000000000000006b ffffffffffffffff 0000000000000000 r08-11 0000000000000000 0000000000000008 00000000000001ff 0000000000000100 r12-15 00000000104a4b40 00000000000000fa 00000000000000f0 00000000000000ff r16-19 00000000104a4b40 00000000f000028c 00000000f0002aa4 0000000000000000 r20-23 00000000103acef0 000000000000494d 00000000103ca704 0000000000000000 r24-27 0000000000000000 00000000104a4b40 0000000000000000 000000001042a430 r28-31 0000000000000004 00000000104a55f0 00000000104a5680 000000001043a430 sr0-3 000000000078aa80 000000000078aa80 0000000000000000 000000000078aa80 sr4-7 0000000000000000 0000000000000000 0000000000000000 0000000000000000 IASQ: 0000000000000000 0000000000000000 IAOQ: 000000001021c7a0 000000001021c7a4 IIR: 0e601208 ISR: 0000000000000000 IOR: 0000000000000004 CPU: 0 CR30: 00000000104a4000 CR31: 00000000104a8000 ORIG_R28: 0000000010230578 And from then the box was completely stuck, no SysRq would work, I had to issue a PC in the GSP to shut it down (it's getting hot again today ;o) I hope this will be useful, Thibaut /me goes back to his exam preparation :^)