Ethernet card is down

soolan

Member
Joined
Jan 13, 2023
Messages
66
Reaction score
4
Credits
626
Hello

OS : oracle linux 8.6
my server suddenly lost connection, restarting the server fixes the problem. What could be causing the problem? happened twice


What else would you recommend I check?
 
Last edited:


Do you have free space on your root / boot dir?
 
How's the RAM usage? Maybe it runs full
 
CRITICAL: Broadcom P210tep NetXtreme-E Dual-port 10GBASE-T Ethernet PCIe Adapter Connectivity status changed to Link Failure for adapter in slot
Maybe hardware issue ... heat being a possible consideration. How's the air flow around that 10G ethernet card? How's the dust environment?
 
Last edited:
Maybe hardware issue ... heat being a possible consideration. How's the air flow around that 10G ethernet card? How's the dust environment?
I know when it made a mistake. Can I control the temperature value via ILO?
 
Can I control the temperature value via ILO?

I've never seen an ILO where that was an option.

Can you install lm_sensors ?
Then run sensors -f

Most NICs don't give temperature for themselves, but at least
it will give you a ball-park of if everything else is running hot on that system.
Are most of the other PCI devices running hot?

Of course it's also possible, the problem is on the remote switch.
Do you have more than one NIC? Can you try another one on that same port?
 
I've never seen an ILO where that was an option.

Can you install lm_sensors ?
Then run sensors -f

Most NICs don't give temperature for themselves, but at least
it will give you a ball-park of if everything else is running hot on that system.
Are most of the other PCI devices running hot?

Of course it's also possible, the problem is on the remote switch.
Do you have more than one NIC? Can you try another one on that same port?
Code:
bnxt_en-pci-1001
Adapter: PCI adapter
temp1:       +174.2°F

coretemp-isa-0001
Adapter: ISA adapter
Package id 1: +138.2°F  (high = +199.4°F, crit = +217.4°F)
Core 0:       +127.4°F  (high = +199.4°F, crit = +217.4°F)
Core 1:       +123.8°F  (high = +199.4°F, crit = +217.4°F)
Core 2:       +122.0°F  (high = +199.4°F, crit = +217.4°F)
Core 3:       +125.6°F  (high = +199.4°F, crit = +217.4°F)
Core 4:       +122.0°F  (high = +199.4°F, crit = +217.4°F)
Core 5:       +120.2°F  (high = +199.4°F, crit = +217.4°F)
Core 6:       +118.4°F  (high = +199.4°F, crit = +217.4°F)
Core 7:       +138.2°F  (high = +199.4°F, crit = +217.4°F)

i350bb-pci-4800
Adapter: PCI adapter
loc1:        +129.2°F  (high = +248.0°F, crit = +230.0°F)

bnxt_en-pci-1000
Adapter: PCI adapter
temp1:       +174.2°F

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +145.4°F  (high = +199.4°F, crit = +217.4°F)
Core 0:       +129.2°F  (high = +199.4°F, crit = +217.4°F)
Core 1:       +125.6°F  (high = +199.4°F, crit = +217.4°F)
Core 2:       +132.8°F  (high = +199.4°F, crit = +217.4°F)
Core 3:       +131.0°F  (high = +199.4°F, crit = +217.4°F)
Core 4:       +127.4°F  (high = +199.4°F, crit = +217.4°F)
Core 5:       +125.6°F  (high = +199.4°F, crit = +217.4°F)
Core 6:       +145.4°F  (high = +199.4°F, crit = +217.4°F)
Core 7:       +145.4°F  (high = +199.4°F, crit = +217.4°F)

power_meter-acpi-0
Adapter: ACPI interface
power1:        0.00 W  (interval = 300.00 s)
Yes i have another nic,
 
bnxt_en-pci-1000 Adapter: PCI adapter temp1: +174.2°F

That one is running pretty hot. Nothing else looks too bad.
 
I encountered this problem again. When I checked the server through iLO, the network card was shown as down. It was resolved after I restarted it
Which logs would be helpful for us to find a solution? Can you translate this into English as well?
 
I constantly see this in var/log/message on this server, but I don't see these on my other servers.

Code:
 systemd[20382]: Stopped target Timers.
 systemd[20382]: Closed D-Bus User Message Bus Socket.
 systemd[20382]: Stopped target Paths.
 systemd[20382]: Reached target Shutdown.
 systemd[20382]: Started Exit the Session.
 systemd[20382]: Reached target Exit the Session.
 systemd[1]: [email protected]: Succeeded.
 systemd[1]: Stopped User Manager for UID 0.
 systemd[1]: Stopping User runtime directory /run/user/0...
 systemd[1]: run-user-0.mount: Succeeded.
 systemd[1]: [email protected]: Succeeded.
 systemd[1]: Stopped User runtime directory /run/user/0.
 systemd[1]: Removed slice User Slice of UID 0.
 crond[20288]: postdrop: warning: unable to look up public/pickup: No such file or directory
 systemd[1]: session-202.scope: Succeeded.
 systemd[1]: Stopping User Manager for UID 1008...
 systemd[20229]: Stopping D-Bus User Message Bus...
 systemd[20229]: Stopped target Default.
 systemd[20229]: Stopped D-Bus User Message Bus.
 systemd[20229]: Stopped target Basic System.
 systemd[20229]: Stopped target Timers.
 systemd[20229]: Stopped Mark boot as successful after the user session has run 2 minutes.
 systemd[20229]: Stopped target Paths.
 systemd[20229]: Stopped target Sockets.
 systemd[20229]: Closed Sound System.
 systemd[20229]: Closed Multimedia System.
 systemd[20229]: Closed D-Bus User Message Bus Socket.
 systemd[20229]: Reached target Shutdown.
 systemd[20229]: Started Exit the Session.
 systemd[20229]: Reached target Exit the Session.
 systemd[1]: [email protected]: Succeeded.
 systemd[1]: Stopped User Manager for UID 1008.
 systemd[1]: Stopping User runtime directory /run/user/1008...
 systemd[1]: run-user-1008.mount: Succeeded.
 systemd[1]: [email protected]: Succeeded.
 systemd[1]: Stopped User runtime directory /run/user/1008.
 systemd[1]: Removed slice User Slice of UID 1008.
 
Secondly, when my Ethernet card is down, log messages.

Code:
 kernel: bnxt_en 0000:10:00.0 ens1f0np0: NIC Link is Down
 smad[1712]: [INFO  ]: AgentX trap received
 smad[1712]: [NOTICE]: AgentX trap CPQNIC (.1.3.6.1.6.3.1.1.4.1.0:.1.3.6.1.4.1.232.0.18012)
 smad[1712]: [NOTICE]: IML received: 171 bytes
 smad[1712]: [ALERT ]: CRITICAL: Broadcom P210tep NetXtreme-E Dual-port 10GBASE-T Ethernet PCIe Adapter Connectivity status changed to Link Failure for adapter in slot 1, port 1
 smad[1712]: [INFO  ]: Log the IML info to syslog
 NetworkManager[1417]: <info>  [1685482012.4493] device (ens1f0np0): carrier: link connected
 smad[1712]: [INFO  ]: AgentX trap received
 smad[1712]: [NOTICE]: AgentX trap CPQNIC (.1.3.6.1.6.3.1.1.4.1.0:.1.3.6.1.4.1.232.0.18011)
 kernel: bnxt_en 0000:10:00.0 ens1f0np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: ON - receive
 kernel: bnxt_en 0000:10:00.0 ens1f0np0: EEE is not active
 kernel: bnxt_en 0000:10:00.0 ens1f0np0: FEC autoneg off encodings: None
 smad[1712]: [NOTICE]: IML received: 177 bytes
 smad[1712]: [ALERT ]: NOTICE: Broadcom P210tep NetXtreme-E Dual-port 10GBASE-T Ethernet PCIe Adapter Connectivity status changed to OK for adapter in slot 1, port 1 has been repaired
 smad[1712]: [INFO  ]: Log the IML info to syslog
 kernel: bnxt_en 0000:10:00.0 ens1f0np0: NIC Link is Down
 smad[1712]: [INFO  ]: AgentX trap received
 smad[1712]: [NOTICE]: AgentX trap CPQNIC (.1.3.6.1.6.3.1.1.4.1.0:.1.3.6.1.4.1.232.0.18012)
 smad[1712]: [NOTICE]: IML received: 171 bytes
 smad[1712]: [ALERT ]: CRITICAL: Broadcom P210tep NetXtreme-E Dual-port 10GBASE-T Ethernet PCIe Adapter Connectivity status changed to Link Failure for adapter in slot 1, port 1
 smad[1712]: [INFO  ]: Log the IML info to syslog
 smad[1712]: [NOTICE]: IML received: 138 bytes
 smad[1712]: [ALERT ]: CRITICAL:  All links are down in adapter Broadcom P210tep NetXtreme-E Dual-port 10GBASE-T Ethernet PCIe Adapter in slot 1
 smad[1712]: [INFO  ]: Log the IML info to syslog
 NetworkManager[1417]: <info>  [1685482015.9492] device (ens1f0np0): carrier: link connected
 smad[1712]: [INFO  ]: AgentX trap received
 smad[1712]: [NOTICE]: AgentX trap CPQNIC (.1.3.6.1.6.3.1.1.4.1.0:.1.3.6.1.4.1.232.0.18011)
 kernel: bnxt_en 0000:10:00.0 ens1f0np0: NIC Link is Up, 1000 Mbps full duplex, Flow control: ON - receive
 kernel: bnxt_en 0000:10:00.0 ens1f0np0: EEE is not active
 
My suspicions about hardware are aroused by the messages:
Code:
... Link Failure ...
... has been repaired ...
... Link Failure ...

It fits the scenario of a contact failing, say from heat expansion which removes contact ("Link Failure", then cooling allowing the contact to be made ("been repaired") then heat again losing contact. Just a theory.
 
My suspicions about hardware are aroused by the messages:
Code:
... Link Failure ...
... has been repaired ...
... Link Failure ...

It fits the scenario of a contact failing, say from heat expansion which removes contact ("Link Failure", then cooling allowing the contact to be made ("been repaired") then heat again losing contact. Just a theory.
When I checked, the temperature was normal
 
Is this your personal server?
 

Members online


Top