April 04, 2020, 11:43:15 pm

News:

Have you visited the Allwinner Chipset wiki? - http://linux-sunxi.org/


Cubietruck NMI watchdog: BUG: soft lockup

Started by jeroen, March 10, 2016, 05:51:53 am

Previous topic - Next topic

jeroen

March 10, 2016, 05:51:53 am Last Edit: March 10, 2016, 05:54:15 am by jeroen
Hi All,

I am very frequently encountering CPU soft lockups on my cubietruck.
I'm running Fedora 23 and have also installed Samba 4.
This issue has occurred several times over the last four to five months.
Sometimes it happens several times a day and then often nothing for weeks.
It usually happens at times of high network IO during scp, rsync or smb file transfers.
Although the soft lockup indicates always that the process locking up is smbd, it does not necessarily happen during smb transfers.
I believe there are two possible options that could cause this issue:
1. Bug in the software.
2. A fault in the hardware.

Unfortunately I do not have a second cubietruck A20 board to rule out a hardware fault.
I'm currently running Fedora 23 with kernel 4.4.3-300.fc23.armv7hl, although I have observed this behaviour on previous kernel builds from Fedora as well.
The setup is a stock build installation using the Fedora-Minimal arm image obtained from here:
http://download.fedoraproject.org/pub/fedora/linux/releases/23/Images/armhfp/
On top of this I've installed the "Fedora Server" component and Samba 4 from the Fedora repository.
I have attached to my cubietruck a 3TB Western Digital SATA disk using the SATA connector which I use to store my file share data.

The issue typically occurs when I transfer large files from our to my cubietruck either using smb, scp, or rsync over ssh.
Below is a trace from the message log at the time the issue occurred:

Mar  9 21:14:54 ct-nas kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [smbd:951]
Mar  9 21:14:54 ct-nas kernel: Modules linked in: nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_filter ebtable_nat ebtable_broute bridge stp llc ebtables ip6table_mangle ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw ip6table_filter ip6_tables iptable_mangle iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw sun4i_codec axp20x_usb_power snd_soc_core snd_pcm_dmaengine axp20x_pek ac97_bus snd_pcm brcmfmac nvmem_sunxi_sid nvmem_core sun4i_ts snd_timer snd soundcore brcmutil cfg80211 sunxi_wdt rfkill sun4i_ss des_generic phy_sun4i_usb leds_gpio cpufreq_dt nfsd axp20x_regulator realtek mmc_block dwmac_sunxi stmmac_platform stmmac ptp pps_core i2c_mv64xxx
Mar  9 21:14:55 ct-nas kernel: rtc_sunxi sunxi_mmc pwm_sun4i ehci_platform ahci_sunxi libahci_platform ohci_platform mmc_core
Mar  9 21:14:55 ct-nas kernel: CPU: 0 PID: 951 Comm: smbd Not tainted 4.4.3-300.fc23.armv7hl #1
Mar  9 21:14:55 ct-nas kernel: Hardware name: Allwinner sun7i (A20) Family
Mar  9 21:14:55 ct-nas kernel: task: ecea5480 ti: e74b6000 task.ti: e74b6000
Mar  9 21:14:55 ct-nas kernel: PC is at _raw_spin_lock+0x34/0x48
Mar  9 21:14:55 ct-nas kernel: LR is at unix_state_double_lock+0x40/0x4c
Mar  9 21:14:55 ct-nas kernel: pc : [<c0918ea4>]    lr : [<c08954a8>]    psr: 80000013#012sp : e74b7ed0  ip : 60000013  fp : becf15f4
Mar  9 21:14:55 ct-nas kernel: r10: e74b7eec  r9 : c0ea2a40  r8 : e74b7f18
Mar  9 21:14:55 ct-nas kernel: r7 : 00000026  r6 : eea69080  r5 : eea692ac  r4 : eb73822c
Mar  9 21:14:55 ct-nas kernel: r3 : 00008599  r2 : 00008594  r1 : 00000000  r0 : eb73822c
Mar  9 21:14:55 ct-nas kernel: Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Mar  9 21:14:55 ct-nas kernel: Control: 10c5387d  Table: 6768406a  DAC: 00000051
Mar  9 21:14:55 ct-nas kernel: CPU: 0 PID: 951 Comm: smbd Not tainted 4.4.3-300.fc23.armv7hl #1
Mar  9 21:14:55 ct-nas kernel: Hardware name: Allwinner sun7i (A20) Family
Mar  9 21:14:55 ct-nas kernel: [<c021928c>] (unwind_backtrace) from [<c0214118>] (show_stack+0x18/0x1c)
Mar  9 21:14:55 ct-nas kernel: [<c0214118>] (show_stack) from [<c0544958>] (dump_stack+0x80/0xa0)
Mar  9 21:14:55 ct-nas kernel: [<c0544958>] (dump_stack) from [<c02f419c>] (watchdog_timer_fn+0x1a8/0x230)
Mar  9 21:14:55 ct-nas kernel: [<c02f419c>] (watchdog_timer_fn) from [<c02b475c>] (__hrtimer_run_queues+0x19c/0x300)
Mar  9 21:14:55 ct-nas kernel: [<c02b475c>] (__hrtimer_run_queues) from [<c02b4f10>] (hrtimer_interrupt+0xb0/0x204)
Mar  9 21:14:55 ct-nas kernel: [<c02b4f10>] (hrtimer_interrupt) from [<c079195c>] (arch_timer_handler_phys+0x30/0x38)
Mar  9 21:14:55 ct-nas kernel: [<c079195c>] (arch_timer_handler_phys) from [<c02a5438>] (handle_percpu_devid_irq+0xc8/0x198)
Mar  9 21:14:55 ct-nas kernel: [<c02a5438>] (handle_percpu_devid_irq) from [<c02a1044>] (generic_handle_irq+0x20/0x30)
Mar  9 21:14:55 ct-nas kernel: [<c02a1044>] (generic_handle_irq) from [<c02a1374>] (__handle_domain_irq+0x94/0xbc)
Mar  9 21:14:55 ct-nas kernel: [<c02a1374>] (__handle_domain_irq) from [<c02096ac>] (gic_handle_irq+0x58/0x80)
Mar  9 21:14:55 ct-nas kernel: [<c02096ac>] (gic_handle_irq) from [<c0919654>] (__irq_svc+0x54/0x70)
Mar  9 21:14:55 ct-nas kernel: Exception stack(0xe74b7e80 to 0xe74b7ec8)
Mar  9 21:14:55 ct-nas kernel: 7e80: eb73822c 00000000 00008594 00008599 eb73822c eea692ac eea69080 00000026
Mar  9 21:14:55 ct-nas kernel: 7ea0: e74b7f18 c0ea2a40 e74b7eec becf15f4 60000013 e74b7ed0 c08954a8 c0918ea4
Mar  9 21:14:55 ct-nas kernel: 7ec0: 80000013 ffffffff
Mar  9 21:14:55 ct-nas kernel: [<c0919654>] (__irq_svc) from [<c0918ea4>] (_raw_spin_lock+0x34/0x48)
Mar  9 21:14:55 ct-nas kernel: [<c0918ea4>] (_raw_spin_lock) from [<c08954a8>] (unix_state_double_lock+0x40/0x4c)
Mar  9 21:14:55 ct-nas kernel: [<c08954a8>] (unix_state_double_lock) from [<c0898ee8>] (unix_dgram_connect+0x8c/0x234)
Mar  9 21:14:55 ct-nas kernel: [<c0898ee8>] (unix_dgram_connect) from [<c07dc4f0>] (SyS_connect+0x84/0xa8)
Mar  9 21:14:55 ct-nas kernel: [<c07dc4f0>] (SyS_connect) from [<c020fc00>] (ret_fast_syscall+0x0/0x3c)
Mar  9 21:14:56 ct-nas abrt-dump-journal-oops: abrt-dump-journal-oops: Found oopses: 1
Mar  9 21:14:56 ct-nas abrt-dump-journal-oops: abrt-dump-journal-oops: Creating problem directories
Mar  9 21:14:57 ct-nas abrt-dump-journal-oops: Reported 1 kernel oopses to Abrt

The issue has been reported on CPU#0 as well as on CPU#1 but the Program Counter always seems to be at "_raw_spin_lock+0x34/0x48"

Has anyone else seen similar issues on the Cubietruck or does anyone have any suggestions how to rule out a hardware issue without having access to a second Cubietruck board?

I would appreciate any input on this before I go down the route of filling a bug report with Fedora.