Page 1 of 2

Starting with a network stack

Posted: Mon Oct 19, 2020 7:31 am
by sunnysideup
I want to learn about computer networks from the bottom up.
I don't really understand the nuances of networking layers, with books mentioning that layers provide services to layers above them, but the interface between layers is not part of the network architecture (Network Architecture is defined as layers + protocols). What is the difference between service and interface? What exactly is a network layer? I can't find clear explanations for these questions anywhere. (I have been reading Computer Networks by A. S. Tanembaum)

So, I wanted to see how the various networking layers and protocols actually work. For this, I have decided to write rtl8139 NIC driver for my kernel.
(I am guessing that the functions performed by the NIC will correspond to the physical layer (mostly), with the framing and error correction functionality of the data link layer also performed by it. Am I right? I will have to write an ethernet driver to complete the functionality of layer 2) --> Is this correct?

I want to emulate/learn the functionality using qemu.
I have read that you would need to give the command line option '-netdev tap,....', along with '-device rtl8139'while invoking the emulator.
* Why is this necessary?
* What would happen if I don't give such a command line option, and instead just add a '-device rtl8139'? I don't mind spending quite a bit of time on this because I want to learn how this really works.

I would appreciate anything on this matter - explanations, advise, ELI5, examples, etc.

Re: Starting with a network stack

Posted: Mon Oct 19, 2020 7:57 am
by rdos
The RTL8139 is old and is not part of modern hardware. If you eventually want to use this for something real, I'd target RTL8169 and compatible network cards instead.

Also, you would need PCI functionality, support for interrupts, and some kind of memory manager to even be able to do something like a network stack.

Re: Starting with a network stack

Posted: Mon Oct 19, 2020 8:12 am
by PeterX
The layers are for getting the network working.
At the highest level there are applications like email-program or web-browser.
At the lowest level is the technical stuff.
The highest level doesn't need to know which technique is used at the lowest level.

The following page explains it a bit (but is complicated) :
https://en.wikipedia.org/wiki/OSI_model

Think of the layers as compatibility: The browser and the webserver must communicate with each other. But what if every browser used a different communication style? Then it wouldn't work. So there is a compatibility: Every web server and every web browser follow the same communication protocol. On lower layers there is another compatibility: The lowest layer (physical) defines bits and bytes and electricity etc. The next layer (from the bottom up) is the data link layer: An example is Ethernet, another example is PPP (internet dialup over phone).

Because this would be too easy, the TCP/IP layers differ slightly from OSI layers. ;) But the principle is the same.
https://en.wikipedia.org/wiki/Internet_protocol_suite

I don't know if all this was helpful. Ask specifically what you want to know, if that was too simple or too complicated.

QEmus netdev specifies which network device ID is used.
https://wiki.qemu.org/Documentation/Net ... backend.3F
I will have to write an ethernet driver to complete the functionality of layer 2) --> Is this correct?
Yes.

Greetings
Peter

Re: Starting with a network stack

Posted: Mon Oct 19, 2020 8:21 am
by rdos
By using the RTL8139 (or better RTL8169), he has already decided that the data link is Ethernet. There are a few Ethernet protocols that must be implemented before doing TCP/IP, especially ARP, but ICMP and DHCP are also good to have.

Re: Starting with a network stack

Posted: Mon Oct 19, 2020 11:28 am
by sunnysideup
Since I'm trying to learn networking from the bottom up, I'm trying to unlearn and relearn, i.e., I'm moving forward thinking that all that I know about the subject might be incorrect.
Here's what I have understood so far:

* Each node in a communication network must do activities (I am not using the term 'software' here because the activities performed at each node can be implemented via digital logic in hardware)
* These activities are often often organized into layers, where the activities in layer N+1 can only use the services the activities provide in layer N.
* A protocol is a way for activities on different nodes, but at the same layer to communicate. Layer N needn't know what protocol layer N-1 uses to communicate.

This is my basic understanding.
rdos wrote:he RTL8139 is old and is not part of modern hardware. If you eventually want to use this for something real, I'd target RTL8169 and compatible network cards instead.
I was looking at driver implementations, and the RTL8139 driver seemed to be the simplest to implement. I am guessing that the activities performed by the RTL8139 would be the same as other advanced NICs, and it would also provide the same services?
rdos wrote:Also, you would need PCI functionality, support for interrupts, and some kind of memory manager to even be able to do something like a network stack.
Done. done. and done :wink:
PeterX wrote:
sunnysideup wrote: I will have to write an ethernet driver to complete the functionality of layer 2) --> Is this correct?
Yes.
It doesn't seem so... From @rdos's answer, it looks like I have to implement additional protocols (ARP, ICMP, DHCP, etc.) to complete the functionality of layer 2.
rdos wrote:By using the RTL8139 (or better RTL8169), he has already decided that the data link is Ethernet. There are a few Ethernet protocols that must be implemented before doing TCP/IP, especially ARP, but ICMP and DHCP are also good to have.
I read the wiki article.. Would it be better to decide that I decided that the MAC sublayer is Ethernet, instead of saying that the data link is Ethernet? Also, I have read that layer N does not care about the protocols used in layer N-1, but you say that TCP/IP (layers 3,4) depend on layer 2 protocols. Why is this?

Re: Starting with a network stack

Posted: Mon Oct 19, 2020 11:54 am
by PeterX
sunnysideup wrote:It doesn't seem so... From @rdos's answer, it looks like I have to implement additional protocols (ARP, ICMP, DHCP, etc.) to complete the functionality of layer 2.
I misunderstood your question as "Is it required to complete". But you meant "Is it enough to complete". So my answer is wrong.

Note that you might code some low level communication using Ethernet but not TCP/IP. I'm thinking of something similar to EtherDFS. This would be only to have some stuff to tinker with, not the ultimate goal. Later you would have to implement higher layers anyway.

Greetings
Peter

Re: Starting with a network stack

Posted: Mon Oct 19, 2020 2:55 pm
by rdos
sunnysideup wrote: I read the wiki article.. Would it be better to decide that I decided that the MAC sublayer is Ethernet, instead of saying that the data link is Ethernet? Also, I have read that layer N does not care about the protocols used in layer N-1, but you say that TCP/IP (layers 3,4) depend on layer 2 protocols. Why is this?
Basically because the addressing on Ethernet is based on MAC addresses, and the IP protocol uses another form of addressing. ARP fixes the matching between MAC addresses and IP addresses (but, potentially many other things too). So, you cannot send IP data on Ethernet without ARP, and you also cannot decode IP data without ARP either.

You might not strictly need DHCP, but it's convient to be able to get an dynamic IP address from your router so you don't need to know which IP address you can use. DHCP also gives you other parameters, like net mask and gateway that you will need later.

The network NIC basically takes care of the physical implementation of Ethernet, but you need to configure it with buffers and set it up.

If you design your network stack properly, you should be able to run TCP/IP with other transport layers than Ethernet, like PPP. You do this by defining an interface that the transport layer must implement which uses IP addresses and IP data.

Also, you should define an interface that the Ethernet NIC driver should implement. Then the ARP protocols can work with many different types of Ethernet NICs, not just the RTL8139. The transport layer for Ethernet then will contain your NIC abstraction, the ARP protocol and a set of NIC drivers.

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 5:46 am
by sunnysideup
I was reading about some functionalities of NICs. Something that seemed to surprise me is the following fact:

In order to send packets to a connected network, the system software (usually the OS, i.e. us) has to manually handcraft a frame (encapsulate a packet into a frame with layer 2 headers). The only functionality that the NIC provides is something like a 'put_frame()' which just 'puts' the frame into the connected network.

However, when receiving packets, the hardware has inbuilt functionality to reject packets which are not addressed to us. That is, there doesn't seem to be a clean abstraction layer provided by the hardware - put_packet() doesn't take the receiver's address as an input parameter, while the receive_packet() does care about the layer 2 address.

This doesn't feel right to me... Perhaps my understanding is incorrect. Anyone willing to shed some light on this?

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 7:26 am
by rdos
sunnysideup wrote:I was reading about some functionalities of NICs. Something that seemed to surprise me is the following fact:

In order to send packets to a connected network, the system software (usually the OS, i.e. us) has to manually handcraft a frame (encapsulate a packet into a frame with layer 2 headers). The only functionality that the NIC provides is something like a 'put_frame()' which just 'puts' the frame into the connected network.

However, when receiving packets, the hardware has inbuilt functionality to reject packets which are not addressed to us. That is, there doesn't seem to be a clean abstraction layer provided by the hardware - put_packet() doesn't take the receiver's address as an input parameter, while the receive_packet() does care about the layer 2 address.

This doesn't feel right to me... Perhaps my understanding is incorrect. Anyone willing to shed some light on this?
I don't know what layer 2 is, but the filtering happens on MAC addresses, not IP addresses. Although I think this is legacy from when Ethernet actually was implemented with a coax cable and had all computers connected to a single cable. Today we have switches that do the filtering, and so I don't think the NIC actually needs to filter. There are also multicast addresses and broadcast addresses.

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 10:37 am
by sunnysideup
rdos wrote:I don't know what layer 2 is, but the filtering happens on MAC addresses, not IP addresses. Although I think this is legacy from when Ethernet actually was implemented with a coax cable and had all computers connected to a single cable. Today we have switches that do the filtering, and so I don't think the NIC actually needs to filter. There are also multicast addresses and broadcast addresses.
So if I understand correctly, modern NICs (WiFi, Ethernet, etc - The most common ones afaik) do not have MAC address filtering functionality built into the hardware (When I was reading about the RTL8139, I did find filtering functionality)? This would imply that the system software (us) would have to manually check every single frame that arrives in the NIC (Not all frames in a network would arrive to a particular NIC of course - switches will filter some) for the destination address.

* Is this correct?
* If so, I can only think of 2 services that a NIC must provide - Putting out frames into the network & Receiving frames from the network. If my understanding is correct, why are NIC drivers so complex for modern NICs? What other services do they provide? What do you require your NIC to do apart from putting any frame to the interface and receiving any frame from the interface?
* Can you send a frame from a NIC with the sender's MAC address in the frame different from the NIC MAC address. I think we can do so because the system software manufactures the frame, and software can be made to do whatever it wants.

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 12:10 pm
by bellezzasolo
sunnysideup wrote:
rdos wrote:I don't know what layer 2 is, but the filtering happens on MAC addresses, not IP addresses. Although I think this is legacy from when Ethernet actually was implemented with a coax cable and had all computers connected to a single cable. Today we have switches that do the filtering, and so I don't think the NIC actually needs to filter. There are also multicast addresses and broadcast addresses.
So if I understand correctly, modern NICs (WiFi, Ethernet, etc - The most common ones afaik) do not have MAC address filtering functionality built into the hardware (When I was reading about the RTL8139, I did find filtering functionality)? This would imply that the system software (us) would have to manually check every single frame that arrives in the NIC (Not all frames in a network would arrive to a particular NIC of course - switches will filter some) for the destination address.

* Is this correct?
* If so, I can only think of 2 services that a NIC must provide - Putting out frames into the network & Receiving frames from the network. If my understanding is correct, why are NIC drivers so complex for modern NICs? What other services do they provide? What do you require your NIC to do apart from putting any frame to the interface and receiving any frame from the interface?
* Can you send a frame from a NIC with the sender's MAC address in the frame different from the NIC MAC address. I think we can do so because the system software manufactures the frame, and software can be made to do whatever it wants.
They don't have to be particularly complex. Here's a functional Intel Gigabit Ethernet driver:

Code: Select all

#include <chaikrnl.h>
#include <pciexpress.h>
#include <kstdio.h>
#include <string.h>
#include <endian.h>

#include <lwip/netifapi.h>
#include <lwip/etharp.h>
#include <lwip/ethip6.h>

#define USE_DHCP 0

/* Registers */
typedef enum
{
	E1000_REG_CTRL = 0x00000, /* Device Control - RW */
	E1000_REG_STATUS = 0x00008, /* Device Status - RO */
	E1000_REG_EECD = 0x00010, /* EEPROM/Flash Control - RW */
	E1000_REG_EERD = 0x00014, /* EEPROM Read - RW */
	E1000_REG_CTRL_EXT = 0x00018, /* Extended Device Control - RW */
	E1000_REG_FLA = 0x0001C, /* Flash Access - RW */
	E1000_REG_MDIC = 0x00020, /* MDI Control - RW */
	E1000_REG_SCTL = 0x00024, /* SerDes Control - RW */
	E1000_REG_FCAL = 0x00028, /* Flow Control Address Low - RW */
	E1000_REG_FCAH = 0x0002C, /* Flow Control Address High -RW */
	E1000_REG_FEXT = 0x0002C, /* Future Extended - RW */
	E1000_REG_FEXTNVM = 0x00028, /* Future Extended NVM - RW */
	E1000_REG_FEXTNVM3 = 0x0003C, /* Future Extended NVM 3 - RW */
	E1000_REG_FEXTNVM4 = 0x00024, /* Future Extended NVM 4 - RW */
	E1000_REG_FEXTNVM6 = 0x00010, /* Future Extended NVM 6 - RW */
	E1000_REG_FEXTNVM7 = 0x000E4, /* Future Extended NVM 7 - RW */
	E1000_REG_FEXTNVM9 = 0x5BB4, /* Future Extended NVM 9 - RW */
	E1000_REG_FEXTNVM11 = 0x5BBC, /* Future Extended NVM 11 - RW */
	E1000_REG_PCIEANACFG = 0x00F18, /* PCIE Analog Config */
	E1000_REG_FCT = 0x00030, /* Flow Control Type - RW */
	E1000_REG_VET = 0x00038, /* VLAN Ether Type - RW */
	E1000_REG_ICR = 0x000C0, /* Interrupt Cause Read - R/clr */
	E1000_REG_ITR = 0x000C4, /* Interrupt Throttling Rate - RW */
	E1000_REG_ICS = 0x000C8, /* Interrupt Cause Set - WO */
	E1000_REG_IMS = 0x000D0, /* Interrupt Mask Set - RW */
	E1000_REG_IMC = 0x000D8, /* Interrupt Mask Clear - WO */
	E1000_REG_IAM = 0x000E0, /* Interrupt Acknowledge Auto Mask */
	E1000_REG_IVAR = 0x000E4, /* Interrupt Vector Allocation Register - RW */
	E1000_REG_SVCR = 0x000F0,
	E1000_REG_SVT = 0x000F4,
	E1000_REG_LPIC = 0x000FC, /* Low Power IDLE control */
	E1000_REG_RCTL = 0x00100, /* Rx Control - RW */
	E1000_REG_FCTTV = 0x00170, /* Flow Control Transmit Timer Value - RW */
	E1000_REG_TXCW = 0x00178, /* Tx Configuration Word - RW */
	E1000_REG_RXCW = 0x00180, /* Rx Configuration Word - RO */
	E1000_REG_PBA_ECC = 0x01100, /* PBA ECC Register */
	E1000_REG_TCTL = 0x00400, /* Tx Control - RW */
	E1000_REG_TCTL_EXT = 0x00404, /* Extended Tx Control - RW */
	E1000_REG_TIPG = 0x00410, /* Tx Inter-packet gap -RW */
	E1000_REG_AIT = 0x00458, /* Adaptive Interframe Spacing Throttle - RW */
	E1000_REG_LEDCTL = 0x00E00, /* LED Control - RW */
	E1000_REG_LEDMUX = 0x08130, /* LED MUX Control */
	E1000_REG_EXTCNF_CTRL = 0x00F00, /* Extended Configuration Control */
	E1000_REG_EXTCNF_SIZE = 0x00F08, /* Extended Configuration Size */
	E1000_REG_PHY_CTRL = 0x00F10, /* PHY Control Register in CSR */
	E1000_REG_PBA = 0x01000, /* Packet Buffer Allocation - RW */
	E1000_REG_PBS = 0x01008, /* Packet Buffer Size */
	E1000_REG_PBECCSTS = 0x0100C, /* Packet Buffer ECC Status - RW */
	E1000_REG_IOSFPC = 0x00F28, /* TX corrupted data  */
	E1000_REG_EEMNGCTL = 0x01010, /* MNG EEprom Control */
	E1000_REG_EEWR = 0x0102C, /* EEPROM Write Register - RW */
	E1000_REG_FLOP = 0x0103C, /* FLASH Opcode Register */
	E1000_REG_ERT = 0x02008, /* Early Rx Threshold - RW */
	E1000_REG_FCRTL = 0x02160, /* Flow Control Receive Threshold Low - RW */
	E1000_REG_FCRTH = 0x02168, /* Flow Control Receive Threshold High - RW */
	E1000_REG_PSRCTL = 0x02170, /* Packet Split Receive Control - RW */
	E1000_REG_RDFH = 0x02410, /* Rx Data FIFO Head - RW */
	E1000_REG_RDFT = 0x02418, /* Rx Data FIFO Tail - RW */
	E1000_REG_RDFHS = 0x02420, /* Rx Data FIFO Head Saved - RW */
	E1000_REG_RDFTS = 0x02428, /* Rx Data FIFO Tail Saved - RW */
	E1000_REG_RDFPC = 0x02430, /* Rx Data FIFO Packet Count - RW */
	E1000_REG_RDTR = 0x02820, /* Rx Delay Timer - RW */
	E1000_REG_RXDCTL = 0x02828,	/* Receive Descriptor Control - RW */
	E1000_REG_RADV = 0x0282C, /* Rx Interrupt Absolute Delay Timer - RW */

	E1000_REG_RAL = 0x05400, // Receive Address Low
	E1000_REG_RAH = 0x05404, // Receive Address High
	E1000_REG_RDBAL = 0x02800, // RX Descriptor Base Address Low
	E1000_REG_RDBAH = 0x02804, // RX Descriptor Base Address High
	E1000_REG_RDLEN = 0x02808, // RX Descriptor Length
	E1000_REG_RDH = 0x02810, // RX Descriptor Head
	E1000_REG_RDT = 0x02818, // RX Descriptor Tail
	E1000_REG_TDBAL = 0x03800, // TX Descriptor Base Address Low
	E1000_REG_TDBAH = 0x03804, // TX Descriptor Base Address High
	E1000_REG_TDLEN = 0x03808, // TX Descriptor Length
	E1000_REG_TDH = 0x03810, // TX Descriptor Head
	E1000_REG_TDT = 0x03818, // TX Descriptor Tail

	E1000_REG_MTA = 0x05200
} e1000_register_t;

/* Device Control */
typedef enum
{
	E1000_CTRL_FD = 0x00000001, /* Full duplex.0=half; 1=full */
	E1000_CTRL_GIO_MASTER_DISABLE = 0x00000004, /* Blocks new Master reqs */
	E1000_CTRL_LRST = 0x00000008, /* Link reset. 0=normal,1=reset */
	E1000_CTRL_ASDE = 0x00000020, /* Auto-speed detect enable */
	E1000_CTRL_SLU = 0x00000040, /* Set link up (Force Link) */
	E1000_CTRL_ILOS = 0x00000080, /* Invert Loss-Of Signal */
	E1000_CTRL_SPD_SEL = 0x00000300, /* Speed Select Mask */
	E1000_CTRL_SPD_10 = 0x00000000, /* Force 10Mb */
	E1000_CTRL_SPD_100 = 0x00000100, /* Force 100Mb */
	E1000_CTRL_SPD_1000 = 0x00000200, /* Force 1Gb */
	E1000_CTRL_FRCSPD = 0x00000800, /* Force Speed */
	E1000_CTRL_FRCDPX = 0x00001000, /* Force Duplex */
	E1000_CTRL_LANPHYPC_OVERRIDE = 0x00010000, /* SW control of LANPHYPC */
	E1000_CTRL_LANPHYPC_VALUE = 0x00020000, /* SW value of LANPHYPC */
	E1000_CTRL_MEHE = 0x00080000, /* Memory Error Handling Enable */
	E1000_CTRL_SWDPIN0 = 0x00040000, /* SWDPIN 0 value */
	E1000_CTRL_SWDPIN1 = 0x00080000, /* SWDPIN 1 value */
	E1000_CTRL_ADVD3WUC = 0x00100000, /* D3 WUC */
	E1000_CTRL_EN_PHY_PWR_MGMT = 0x00200000, /* PHY PM enable */
	E1000_CTRL_SWDPIO0 = 0x00400000, /* SWDPIN 0 Input or output */
	E1000_CTRL_RST = 0x04000000, /* Global reset */
	E1000_CTRL_RFCE = 0x08000000, /* Receive Flow Control enable */
	E1000_CTRL_TFCE = 0x10000000, /* Transmit flow control enable */
	E1000_CTRL_VME = 0x40000000, /* IEEE VLAN mode enable */
	E1000_CTRL_PHY_RST = 0x80000000, /* PHY Reset */
} e1000_ctrl_flags_t;

/* Extended Device Control */
typedef enum
{
	E1000_CTRL_EXT_LPCD = 0x00000004, /* LCD Power Cycle Done */
	E1000_CTRL_EXT_SDP3_DATA = 0x00000080, /* SW Definable Pin 3 data */
	E1000_CTRL_EXT_FORCE_SMBUS = 0x00000800, /* Force SMBus mode */
	E1000_CTRL_EXT_EE_RST = 0x00002000, /* Reinitialize from EEPROM */
	E1000_CTRL_EXT_SPD_BYPS = 0x00008000, /* Speed Select Bypass */
	E1000_CTRL_EXT_RO_DIS = 0x00020000, /* Relaxed Ordering disable */
	E1000_CTRL_EXT_DMA_DYN_CLK_EN = 0x00080000, /* DMA Dynamic Clk Gating */
	E1000_CTRL_EXT_LINK_MODE_MASK = 0x00C00000,
	E1000_CTRL_EXT_LINK_MODE_PCIE_SERDES = 0x00C00000,
	E1000_CTRL_EXT_EIAME = 0x01000000,
	E1000_CTRL_EXT_DRV_LOAD = 0x10000000, /* Drv loaded bit for FW */
	E1000_CTRL_EXT_IAME = 0x08000000, /* Int ACK Auto-mask */
	E1000_CTRL_EXT_PBA_CLR = 0x80000000, /* PBA Clear */
	E1000_CTRL_EXT_LSECCK = 0x00001000,
	E1000_CTRL_EXT_PHYPDEN = 0x00100000,
} e1000_ctrl_ext_flags_t;

/* Device Status */
typedef enum
{
	E1000_STATUS_FD = 0x00000001, /* Duplex 0=half 1=full */
	E1000_STATUS_LU = 0x00000002, /* Link up.0=no,1=link */
	E1000_STATUS_FUNC_MASK = 0x0000000C, /* PCI Function Mask */
	E1000_STATUS_FUNC_SHIFT = 2,
	E1000_STATUS_FUNC_1 = 0x00000004, /* Function 1 */
	E1000_STATUS_TXOFF = 0x00000010, /* transmission paused */
	E1000_STATUS_SPEED_MASK = 0x000000C0,
	E1000_STATUS_SPEED_10 = 0x00000000, /* Speed 10Mb/s */
	E1000_STATUS_SPEED_100 = 0x00000040, /* Speed 100Mb/s */
	E1000_STATUS_SPEED_1000 = 0x00000080, /* Speed 1000Mb/s */
	E1000_STATUS_LAN_INIT_DONE = 0x00000200, /* Lan Init Compltn by NVM */
	E1000_STATUS_PHYRA = 0x00000400, /* PHY Reset Asserted */
	E1000_STATUS_GIO_MASTER_ENABLE = 0x00080000, /* Master request status */
	E1000_STATUS_2P5_SKU = 0x00001000, /* Val of 2.5GBE SKU strap */
	E1000_STATUS_2P5_SKU_OVER = 0x00002000, /* Val of 2.5GBE SKU Over */
} e1000_device_status_flags_t;

/* Receive Control */
typedef enum
{
	E1000_RCTL_EN = 0x00000002, /* enable */
	E1000_RCTL_SBP = 0x00000004, /* store bad packet */
	E1000_RCTL_UPE = 0x00000008, /* unicast promisc enable */
	E1000_RCTL_MPE = 0x00000010, /* multicast promisc enable */
	E1000_RCTL_LPE = 0x00000020, /* long packet enable */
	E1000_RCTL_LBM_NO = 0x00000000, /* no loopback mode */
	E1000_RCTL_LBM_MAC = 0x00000040, /* MAC loopback mode */
	E1000_RCTL_LBM_TCVR = 0x000000C0, /* tcvr loopback mode */
	E1000_RCTL_DTYP_PS = 0x00000400, /* Packet Split descriptor */
	E1000_RCTL_RDMTS_HALF = 0x00000000, /* Rx desc min thresh size */
	E1000_RCTL_RDMTS_HEX = 0x00010000,
	E1000_RCTL_RDMTS1_HEX = E1000_RCTL_RDMTS_HEX,
	E1000_RCTL_MO_SHIFT = 12, /* multicast offset shift */
	E1000_RCTL_MO_3 = 0x00003000, /* multicast offset 15:4 */
	E1000_RCTL_BAM = 0x00008000, /* broadcast enable */
	/* these buffer sizes are valid if E1000_RCTL_BSEX is 0 */
	E1000_RCTL_SZ_2048 = 0x00000000, /* Rx buffer size 2048 */
	E1000_RCTL_SZ_1024 = 0x00010000, /* Rx buffer size 1024 */
	E1000_RCTL_SZ_512 = 0x00020000, /* Rx buffer size 512 */
	E1000_RCTL_SZ_256 = 0x00030000, /* Rx buffer size 256 */
	/* these buffer sizes are valid if E1000_RCTL_BSEX is 1 */
	E1000_RCTL_SZ_16384 = 0x00010000, /* Rx buffer size 16384 */
	E1000_RCTL_SZ_8192 = 0x00020000, /* Rx buffer size 8192 */
	E1000_RCTL_SZ_4096 = 0x00030000, /* Rx buffer size 4096 */
	E1000_RCTL_VFE = 0x00040000, /* vlan filter enable */
	E1000_RCTL_CFIEN = 0x00080000, /* canonical form enable */
	E1000_RCTL_CFI = 0x00100000, /* canonical form indicator */
	E1000_RCTL_DPF = 0x00400000, /* discard pause frames */
	E1000_RCTL_PMCF = 0x00800000, /* pass MAC control frames */
	E1000_RCTL_BSEX = 0x02000000, /* Buffer size extension */
	E1000_RCTL_SECRC = 0x04000000, /* Strip Ethernet CRC */
} e1000_rx_ctrl_flags_t;

/* Transmit Control */
typedef enum
{
	E1000_TCTL_EN = 0x00000002, /* enable Tx */
	E1000_TCTL_PSP = 0x00000008, /* pad short packets */
	E1000_TCTL_CT = 0x00000ff0, /* collision threshold */
	E1000_TCTL_COLD = 0x003ff000, /* collision distance */
	E1000_TCTL_RTLC = 0x01000000, /* Re-transmit on late collision */
	E1000_TCTL_MULR = 0x10000000, /* Multiple request support */
} e1000_tx_ctrl_flags_t;

/* Interrupt Cause Read */
typedef enum
{
	E1000_ICR_TXDW = 0x00000001, /* Transmit desc written back */
	E1000_ICR_LSC = 0x00000004, /* Link Status Change */
	E1000_ICR_RXSEQ = 0x00000008, /* Rx sequence error */
	E1000_ICR_RXDMT0 = 0x00000010, /* Rx desc min. threshold (0) */
	E1000_ICR_RXT0 = 0x00000080, /* Rx timer intr (ring 0) */
	E1000_ICR_ECCER = 0x00400000, /* Uncorrectable ECC Error */
	E1000_ICR_INT_ASSERTED = 0x80000000, /* If this bit asserted, the driver should claim the interrupt */
	E1000_ICR_RXQ0 = 0x00100000, /* Rx Queue 0 Interrupt */
	E1000_ICR_RXQ1 = 0x00200000, /* Rx Queue 1 Interrupt */
	E1000_ICR_TXQ0 = 0x00400000, /* Tx Queue 0 Interrupt */
	E1000_ICR_TXQ1 = 0x00800000, /* Tx Queue 1 Interrupt */
	E1000_ICR_OTHER = 0x01000000, /* Other Interrupts */
} e1000_intr_cause_flags_t;

/* Interrupt Mask Set */
typedef enum
{
	E1000_IMS_TXDW = E1000_ICR_TXDW, /* Tx desc written back */
	E1000_IMS_LSC = E1000_ICR_LSC, /* Link Status Change */
	E1000_IMS_RXSEQ = E1000_ICR_RXSEQ, /* Rx sequence error */
	E1000_IMS_RXDMT0 = E1000_ICR_RXDMT0, /* Rx desc min. threshold */
	E1000_IMS_RXT0 = E1000_ICR_RXT0, /* Rx timer intr */
	E1000_IMS_ECCER = E1000_ICR_ECCER, /* Uncorrectable ECC Error */
	E1000_IMS_RXQ0 = E1000_ICR_RXQ0, /* Rx Queue 0 Interrupt */
	E1000_IMS_RXQ1 = E1000_ICR_RXQ1, /* Rx Queue 1 Interrupt */
	E1000_IMS_TXQ0 = E1000_ICR_TXQ0, /* Tx Queue 0 Interrupt */
	E1000_IMS_TXQ1 = E1000_ICR_TXQ1, /* Tx Queue 1 Interrupt */
	E1000_IMS_OTHER = E1000_ICR_OTHER, /* Other Interrupt */
} e1000_intr_mask_flags_t;

// Structure of transmit descriptors.
#pragma pack(push, 1)
typedef struct
{
	// Physical address of the transmit descriptor packet buffer.
	uint64_t address;

	// Length of the packet part to transmit.
	uint16_t length;

	// Check sum offset.
	uint8_t cso;

	// Command field.
	uint8_t command;

	// Packet transmission status.
	uint8_t status;

	// Checksum start field.
	uint8_t css;

	// Unused.
	uint16_t special;
} tx_desc_t;

typedef struct {
	volatile uint64_t addr;
	volatile uint16_t length;
	volatile uint16_t checksum;
	volatile uint8_t status;
	volatile uint8_t errors;
	volatile uint16_t special;
}rx_desc_t;
#pragma pack(pop)

pci_device_declaration i219_devs[] =
{
	{0x8086, 0x10D3, PCI_CLASS_ANY},
	{0x8086, 0x15BC, PCI_CLASS_ANY},
	PCI_DEVICE_END
};

paddr_t pmmngr_allloc_contig(size_t numpages)
{
	if (numpages == 0)
		return 0;
	paddr_t phy_addr = pmmngr_allocate(1);
	for (int i = 1; i < numpages; ++i)
	{
		paddr_t alloc = pmmngr_allocate(1);
		if (alloc != phy_addr + i * PAGESIZE)
		{
			kprintf(u"Contiguity failure: %x -> %x, iteration %d\n", phy_addr, alloc, i);
			pmmngr_free(alloc, 1);
			pmmngr_free(phy_addr, i);
			return 0;
		}
	}
	return phy_addr;
}

static bool i219_allocate_hardware_buffer(void** mapped_address, paddr_t& phy_addr, size_t objectlength, size_t count)
{
	size_t numpages = DIV_ROUND_UP(objectlength * count, PAGESIZE);
	phy_addr = pmmngr_allloc_contig(numpages);
	if (!phy_addr)
		return false;
	*mapped_address = find_free_paging(numpages * PAGESIZE);
	if (!paging_map(*mapped_address, phy_addr, numpages * PAGESIZE, PAGE_ATTRIBUTE_NO_CACHING | PAGE_ATTRIBUTE_WRITABLE))
		return false;
	return true;
}

class I219Registers {
public:
	I219Registers(void* mappedio)
		:mappedmem(mappedio)
	{

	}

	uint32_t read(e1000_register_t reg)
	{
		return *raw_offset<volatile uint32_t*>(mappedmem, reg);
	}

	void write(e1000_register_t reg, uint32_t value)
	{
		*raw_offset<volatile uint32_t*>(mappedmem, reg) = value;
	}

private:
	const void* mappedmem;
};

struct i219_driver_info {
	void* mapped_controller;
	pci_address address;
	netif* netif;
	tx_desc_t* maptxdescs;
	void* maptxbuf;
	size_t TX_BUFFER_SIZE;
	size_t TX_DESC_COUNT;
	size_t txTail = 0;
	rx_desc_t* maprxdescs;
	void* maprxbuf;
	paddr_t rx_buf_phy;
	size_t RX_BUFFER_SIZE;
	size_t RX_DESC_COUNT;
	uint32_t rxCur = 0;
};

static uint8_t ethernet_interrupt(size_t vector, void* param)
{
	i219_driver_info* dinfo = (i219_driver_info*)param;
	I219Registers devregs(dinfo->mapped_controller);
	uint32_t status = devregs.read(E1000_REG_ICR);
	if (status & E1000_ICR_LSC)
	{
		uint32_t devstat = devregs.read(E1000_REG_STATUS);
		uint8_t up = devstat & (1 << 1);
		//kprintf(u"Link status interrupt fired: active %d\n", up);
		if (up)
		{
			netifapi_netif_set_link_up_async(dinfo->netif);
		}
		else
		{
			netifapi_netif_set_link_down_async(dinfo->netif);
		}
	}
	if (status & E1000_ICR_RXT0)
	{
		//kprintf(u"Packet interrupt: RX %d\n", dinfo->rxCur);
		while (dinfo->maprxdescs[dinfo->rxCur].status != 0)
		{
			uint8_t *buf = raw_offset<uint8_t*>(dinfo->maprxbuf, dinfo->maprxdescs[dinfo->rxCur].addr - dinfo->rx_buf_phy);
			uint16_t len = dinfo->maprxdescs[dinfo->rxCur].length;

			uint16_be* type = raw_offset<uint16_be*>(buf, 12);
			uint16_t tp = BE_TO_CPU16((*type));
			if (tp == 0x88A8)
				type = raw_offset<uint16_be*>(type, 8);
			else if(tp == 0x8100)
				type = raw_offset<uint16_be*>(type, 4);
			tp = BE_TO_CPU16((*type));
			//kprintf(u" Received a packet: type %x, length %d, status %d\n", tp, len, dinfo->maprxdescs[dinfo->rxCur].status);

			// Here you should inject the received packet into your network stack
			pbuf* p = pbuf_alloc(PBUF_RAW, len, PBUF_POOL);
			memcpy(p->payload, buf, len);		//We release the device buffer
			p->if_idx = dinfo->netif->num;
			dinfo->netif->input(p, dinfo->netif);


			dinfo->maprxdescs[dinfo->rxCur].status = 0;
			auto old_cur = dinfo->rxCur;
			dinfo->rxCur = (dinfo->rxCur + 1) % dinfo->RX_DESC_COUNT;
			devregs.write(E1000_REG_RDT, old_cur);		//TODO: this is probably wrong
		}
	}
#if 0
	else
	{
		kprintf(u"Unknown ethernet interrupt: status %x\n", status);
	}
#endif
	devregs.write(E1000_REG_ICR, status);		//Interrupt handled
	return 1;
}

#pragma pack(push, 1)
struct arp_packet {
	uint16_be hw_type;
	uint16_be prot_type;
	uint8_t hw_addrlen;
	uint8_t pr_addrlen;
	uint16_be operation;
	uint8_t sender_hw_address[6];
	uint8_t sender_protaddr[4];
	uint8_t target_hw_address[6];
	uint8_t target_protaddr[4];
};
static_assert(sizeof(arp_packet) == 28, "bad ARP packet size");

template <size_t n> struct ethernet_frame {
	uint8_t mac_destination[6];
	uint8_t mac_source[6];
	uint16_be size_type;
	uint8_t payload[n];
	uint32_be crc;
};
#pragma pack(pop, 1)

err_t i219_tx(struct netif *netif, struct pbuf *p)
{
	uint16_be* type = raw_offset<uint16_be*>(p->payload, 12);
	uint16_t tp = BE_TO_CPU16((*type));
	if (tp == 0x88A8)
		type = raw_offset<uint16_be*>(type, 8);
	else if (tp == 0x8100)
		type = raw_offset<uint16_be*>(type, 4);
	tp = BE_TO_CPU16((*type));
	auto dinfo = (i219_driver_info*)netif->state;
	void* mapped_controller = dinfo->mapped_controller;
	I219Registers devregs(mapped_controller);

	size_t txTail = dinfo->txTail;
	size_t next = txTail + 1;
	if (next == dinfo->TX_DESC_COUNT)
		next = 0;

	//kprintf(u"Network stack tx: length %d, type %x, slot %d[n:%d]\n", p->len, tp, txTail, next);
	dinfo->txTail = next;

	tx_desc_t* descriptor = &dinfo->maptxdescs[txTail];

	if (descriptor->length)
		while (!(descriptor->status & 0xF)) // Catches "descriptor done" (DD) and various errors
			arch_pause();
	//Send
	memcpy(raw_offset<void*>(dinfo->maptxbuf, dinfo->TX_BUFFER_SIZE * txTail), p->payload, p->len);
	descriptor->length = p->len;
	descriptor->command = 0x8 | 0x2 | 0x1;
	descriptor->cso = 0;
	descriptor->css = 0;
	descriptor->status = 0;
	descriptor->special = 0;

	devregs.write(E1000_REG_TDT, next);

	return ERR_OK;
}

err_t i219_init(struct netif *netif)
{
	u8_t i;

	i219_driver_info* dinfo = (i219_driver_info*)netif->state;

	I219Registers devregs(dinfo->mapped_controller);

	unsigned char macAddress[6];
	uint32_t macLow = devregs.read(E1000_REG_RAL);
	if (macLow != 0x00000000)
	{
		// MAC can be read from RAL[0]/RAH[0] MMIO directly
		macAddress[0] = macLow & 0xFF;
		macAddress[1] = (macLow >> 8) & 0xFF;
		macAddress[2] = (macLow >> 16) & 0xFF;
		macAddress[3] = (macLow >> 24) & 0xFF;
		uint32_t macHigh = devregs.read(E1000_REG_RAH);
		macAddress[4] = macHigh & 0xFF;
		macAddress[5] = (macHigh >> 8) & 0xFF;
	}
	else
	{
		kprintf(u"Could not read MAC from MMIO\n");
		return ERR_ARG;
	}
	kprintf_a("MAC Address: %x:%x:%x:%x:%x:%x\n", macAddress[0], macAddress[1], macAddress[2], macAddress[3], macAddress[4], macAddress[5]);

	for (i = 0; i < ETH_HWADDR_LEN; i++) {
		netif->hwaddr[i] = macAddress[i];
	}
	netif->hwaddr_len = 6;
	netif->name[0] = 'e';
	netif->name[1] = 'n';
	netif->mtu = 1522;
	netif->num = 0;

	//Disable interrupts
	devregs.write(E1000_REG_IMC, UINT32_MAX);
	//Clear pending
	devregs.read(E1000_REG_ICR);

	// Clear multicast table array
	for (int i = 0; i < 128; ++i)
		devregs.write((e1000_register_t)(E1000_REG_MTA + 4 * i), 0x00000000);

	const size_t TX_BUFFER_SIZE = 2048;
	const size_t TX_DESC_COUNT = 8;

	paddr_t txbuf = 0;
	void* maptxbuf = nullptr;
	if (!i219_allocate_hardware_buffer(&maptxbuf, txbuf, TX_BUFFER_SIZE, TX_DESC_COUNT))
	{
		kprintf(u"Error: could not create Intel Gigabit Controller Tx Buffer: size %x\n", TX_BUFFER_SIZE * TX_DESC_COUNT);
		return ERR_MEM;
	}

	uint64_t txbufd;
	tx_desc_t* maptxdescs;
	if (!i219_allocate_hardware_buffer((void**)&maptxdescs, txbufd, sizeof(tx_desc_t), TX_DESC_COUNT))
	{
		kprintf(u"Error: could not create Intel Gigabit Controller Tx Buffer: size %x\n", TX_DESC_COUNT * sizeof(tx_desc_t));
		return ERR_MEM;
	}

	dinfo->maptxdescs = maptxdescs;
	dinfo->maptxbuf = maptxbuf;
	dinfo->TX_BUFFER_SIZE = TX_BUFFER_SIZE;
	dinfo->TX_DESC_COUNT = TX_DESC_COUNT;
	dinfo->txTail = 0;

	for (int i = 0; i < TX_DESC_COUNT; ++i)
	{
		// Initialize descriptor
		tx_desc_t *currDesc = &maptxdescs[i];
		currDesc->address = raw_offset<uint64_t>(txbuf, i * TX_BUFFER_SIZE);
		currDesc->length = 0;
		currDesc->status = 0;
		currDesc->cso = 0;
		currDesc->css = 0;
		currDesc->special = 0;
	}

	devregs.write(E1000_REG_TDBAH, txbufd >> 32);
	devregs.write(E1000_REG_TDBAL, txbufd & UINT32_MAX);
	devregs.write(E1000_REG_TDLEN, TX_DESC_COUNT * sizeof(tx_desc_t));
	devregs.write(E1000_REG_TDH, 0);
	devregs.write(E1000_REG_TDT, 0);

	//Receieve descriptors
	const size_t RX_BUFFER_SIZE = 2048;
	const size_t RX_DESC_COUNT = 8;

	paddr_t rxbuf;
	void* maprxbuf;
	if (!i219_allocate_hardware_buffer((void**)&maprxbuf, rxbuf, RX_BUFFER_SIZE, RX_DESC_COUNT))
	{
		kprintf(u"Error: could not create Intel Gigabit Controller Rx Buffer: size %x\n", RX_DESC_COUNT * RX_BUFFER_SIZE);
		return ERR_MEM;
	}

	uint64_t rxbufd;
	rx_desc_t* maprxdescs;
	if (!i219_allocate_hardware_buffer((void**)&maprxdescs, rxbufd, sizeof(rx_desc_t), RX_DESC_COUNT))
	{
		kprintf(u"Error: could not create Intel Gigabit Controller Rx Buffer: size %x\n", RX_DESC_COUNT * sizeof(rx_desc_t));
		return ERR_MEM;
	}

	for (int i = 0; i < RX_DESC_COUNT; ++i)
	{
		// Initialize descriptor
		rx_desc_t *currDesc = &maprxdescs[i];
		currDesc->addr = raw_offset<uint64_t>(rxbuf, i * RX_BUFFER_SIZE);
		currDesc->length = 0;
		currDesc->status = 0;
		currDesc->checksum = 0;
		currDesc->errors = 0;
		currDesc->special = 0;
	}

	devregs.write(E1000_REG_RDBAH, rxbufd >> 32);
	devregs.write(E1000_REG_RDBAL, rxbufd & UINT32_MAX);
	devregs.write(E1000_REG_RDLEN, RX_DESC_COUNT * sizeof(rx_desc_t));
	devregs.write(E1000_REG_RDH, 0);
	devregs.write(E1000_REG_RDT, RX_DESC_COUNT - 1);	//Index of descriptor beyond last valid
	devregs.write(E1000_REG_RDTR, 0);

	dinfo->maprxdescs = maprxdescs;
	dinfo->maprxbuf = maprxbuf;
	dinfo->rx_buf_phy = rxbuf;
	dinfo->RX_BUFFER_SIZE = RX_BUFFER_SIZE;
	dinfo->RX_DESC_COUNT = RX_DESC_COUNT;
	dinfo->rxCur = 0;

	PciAllocateMsi(dinfo->address.segment, dinfo->address.bus, dinfo->address.device, dinfo->address.function, 1, &ethernet_interrupt, dinfo);

	//Allocate PBUFs for receiving

	// Enable transmitter
	uint32_t tctl = devregs.read(E1000_REG_TCTL);
	tctl |= E1000_TCTL_EN; // EN (Transmitter Enable)
	tctl |= E1000_TCTL_PSP; // PSP (Pad Short Packets)
	tctl |= E1000_TCTL_RTLC; // RTLC (Re-transmit on Late Collision)
	devregs.write(E1000_REG_TCTL, tctl);

	//Disable prefetch
	devregs.write(E1000_REG_RXDCTL, 0);
	// Enable receiver
	uint32_t rctl = devregs.read(E1000_REG_RCTL);
	rctl |= E1000_RCTL_EN; // EN (Receiver Enable)
	rctl &= ~E1000_RCTL_SBP; // SBP (Store Pad Packets)
	rctl |= E1000_RCTL_BAM; // BAM (Broadcast Accept Mode)
	rctl &= ~E1000_RCTL_SZ_4096;
	rctl |= E1000_RCTL_SZ_2048; // BSIZE = 2048 (Receive Buffer Size)
	rctl &= ~E1000_RCTL_BSEX;
	rctl |= E1000_RCTL_SECRC; // SECRC (Strip Ethernet CRC)
	devregs.write(E1000_REG_RCTL, rctl);

	//Enable interrupts
	devregs.write(E1000_REG_IAM, 0);
	devregs.write(E1000_REG_IMS, E1000_IMS_RXT0 | E1000_IMS_LSC);
	//devregs.write(E1000_REG_IMS, UINT32_MAX);

	netif->flags |= NETIF_FLAG_ETHARP | NETIF_FLAG_ETHERNET | NETIF_FLAG_BROADCAST;
	netif->linkoutput = &i219_tx;
	netif->output = &etharp_output;
	netif->output_ip6 = &ethip6_output;

	uint32_t devstat = devregs.read(E1000_REG_STATUS);
	uint8_t up = devstat & (1 << 1);
	//kprintf(u"Link active %d\n", up);
	if (up)
	{
		netifapi_netif_set_link_up_async(netif);
	}
}

static ip4_addr_t testing_ip;
static ip4_addr_t testing_netmask;
static ip4_addr_t testing_gateway;

bool Intel219Finder(uint16_t segment, uint16_t bus, uint16_t device, uint8_t function)
{
	size_t barsize = 0;
	paddr_t devbase = read_pci_bar(segment, bus, device, function, 0, &barsize);
	void* mapped_controller = find_free_paging(barsize);
	if (!paging_map(mapped_controller, devbase, barsize, PAGE_ATTRIBUTE_NO_CACHING | PAGE_ATTRIBUTE_WRITABLE))
	{
		kprintf(u"Error: could not map Intel Gigabit Controller: size %x\n", barsize);
		return false;
	}
	uint64_t commstat;
	read_pci_config(segment, bus, device, function, 1, 32, &commstat);
	commstat |= (1 << 10);	//Mask pinned interrupts
	commstat |= 0x6;		//Memory space and bus mastering
	write_pci_config(segment, bus, device, function, 1, 32, commstat);

	kprintf(u"Found an Intel Gigabit Ethernet card! %d:%d:%d:%d\n", segment, bus, device, function);

	i219_driver_info* state = new i219_driver_info;
	state->mapped_controller = mapped_controller;
	state->address.bus = bus;
	state->address.device = device;
	state->address.segment = segment;
	state->address.function = function;

	netif* netdev = new netif;
	memset(netdev, 0, sizeof(netif));
	state->netif = netdev;
	//netdev->hwaddr
	char16_t iptest = 'ח';		//Chet
	IP4_ADDR(&testing_ip, 172, 16, iptest >> 8, iptest & 0xFF);
	IP4_ADDR(&testing_netmask, 255, 255, 0, 0);
	IP4_ADDR(&testing_gateway, 172, 16, 0, 1);
#if USE_DHCP
	netifapi_netif_add(netdev, NULL, NULL, NULL, state, &i219_init, &tcpip_input);
#else
	netifapi_netif_add(netdev, &testing_ip, &testing_netmask, &testing_gateway, state, &i219_init, &tcpip_input);
#endif
	netifapi_netif_set_up_async(netdev);
#if USE_DHCP
	netifapi_dhcp_start(netdev);
#endif
	return false;
}

static pci_device_registration dev_reg_pci = {
	i219_devs,
	&Intel219Finder
};

EXTERN int CppDriverEntry(void* param)
{
	//Find relevant devices
	register_pci_driver(&dev_reg_pci);
	return 0;
}
But modern cards do offer a lot of features. Think Wake-on-LAN, which does offer address filtering. You can also offload checksum stuff to the cards, I believe.

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 1:32 pm
by nullplan
sunnysideup wrote:If so, I can only think of 2 services that a NIC must provide - Putting out frames into the network & Receiving frames from the network. If my understanding is correct, why are NIC drivers so complex for modern NICs? What other services do they provide? What do you require your NIC to do apart from putting any frame to the interface and receiving any frame from the interface?
Well yes, those are the principal two functions of a NIC. However, contrary to rdos's assertion, to my (limited) knowledge, MAC address filtering is still very much a thing. Remember that switches are Ethernet bridges. Bridges send packets to the port they know the destination to be on, but if they see an address they don't know, they send the packet to every port. They cannot work any different way. In this way, even on a switched network, you will get packages not meant for you. As for multicast, there are multicast MAC addresses, and modern NICs support adding those to the list of allowed MAC addresses. Allowing all incoming packets is called "Promiscuous mode" and is a feature you have to enable specifically in the NIC.

Anyway, as to what other functions there are in a NIC: Many of them support checksum offloading. That means that the OS does not compute TCP/UDP checksum, IP checksum, or Ethernet CRC, and just lets the NIC do those things. This is highly card dependent, and has to be set up. Furthermore, most 10Gb cards support Scatter/Gather I/O, in order to support zero-copy TCP. 10Gb is so fast that you cannot afford to copy the TCP payload into a new buffer, just to prepend the TCP header, then copy all of that into a new buffer, just to prepend the IP header, then copy all of that into a new buffer, just to add Ethernet header and footer. Even a better in-kernel system, allowing you to copy the TCP payload only once would not be fast enough to actually use all of the available bandwidth. And that is with all of the payload already in RAM. So instead they support the card DMAing all the different parts from different places into the send buffer, and DMAing received packets the other way into different buffers from the receive buffer. (And I don't really know how that one works securely, or even safely. Likely checks for certain key bytes being correct.)

As for the remaining difficulties with NICs, most of them are about overly complex interrupt systems that try to prevent the one event users hate more than anything: Packets dropping because of resource exhaustion.

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 1:39 pm
by eekee
sunnysideup wrote:...The only functionality that the NIC provides is something like a 'put_frame()' which just 'puts' the frame into the connected network.

However, when receiving packets, the hardware has inbuilt functionality to reject packets which are not addressed to us. That is, there doesn't seem to be a clean abstraction layer provided by the hardware - put_packet() doesn't take the receiver's address as an input parameter, while the receive_packet() does care about the layer 2 address.

This doesn't feel right to me... Perhaps my understanding is incorrect. Anyone willing to shed some light on this?
Symmetry isn't useful in this situation. The hardware to reject packets not addressed to this machine is useful: it saves the computer time servicing the interrupt and parsing the packet headers. Hardware to add the address for sending would be a complete waste of time, you'd just be writing the packet address to a different memory/IO address than the rest of the packet.

Something I learned a long time ago: Always beware "it doesn't feel right". The feeling is often not helpful and can be outright harmful, such as by making simple, straightforward things hard to understand simply because they're not symmetrical or don't meet some other artificial standard of beauty.

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 3:16 pm
by rdos
sunnysideup wrote:
rdos wrote:I don't know what layer 2 is, but the filtering happens on MAC addresses, not IP addresses. Although I think this is legacy from when Ethernet actually was implemented with a coax cable and had all computers connected to a single cable. Today we have switches that do the filtering, and so I don't think the NIC actually needs to filter. There are also multicast addresses and broadcast addresses.
So if I understand correctly, modern NICs (WiFi, Ethernet, etc - The most common ones afaik) do not have MAC address filtering functionality built into the hardware (When I was reading about the RTL8139, I did find filtering functionality)? This would imply that the system software (us) would have to manually check every single frame that arrives in the NIC (Not all frames in a network would arrive to a particular NIC of course - switches will filter some) for the destination address.
I don't think I wrote that it didn't filter, rather that this is not such a big issue anymore. With the original Ethernet hardware you could set the Ethernet NIC into promiscious mode and see everything transmitted on Ethernet, which certainly was a security issue, but doing this today won't get a lot more information since switches will primarily send packets addressed to your computer.

Even with filtering, you still need to inspect every packet and decide what to do with it. Some broadcasts will be irrelevant, some could be used to save MACs and some need to be replied to (those addressing your own IP).
sunnysideup wrote: * If so, I can only think of 2 services that a NIC must provide - Putting out frames into the network & Receiving frames from the network. If my understanding is correct, why are NIC drivers so complex for modern NICs? What other services do they provide? What do you require your NIC to do apart from putting any frame to the interface and receiving any frame from the interface?
Not much, but this is quite complicated since most will use RAM buffering and bus mastering. Some might also be able to offload checksum calculations and handle scatter gather, but this will complicate your network stack.

Re: Starting with a network stack

Posted: Sat Oct 24, 2020 3:25 pm
by rdos
nullplan wrote:Well yes, those are the principal two functions of a NIC. However, contrary to rdos's assertion, to my (limited) knowledge, MAC address filtering is still very much a thing. Remember that switches are Ethernet bridges. Bridges send packets to the port they know the destination to be on, but if they see an address they don't know, they send the packet to every port.
Unless the switch is rebooted (which is unsual), sending to unknown destinations will always involve sending ARP requests, and everybody on the network needs to handle those and they are always broadcasted. So, filtering does no good in that case since it is broadcasts and ARPs needs to be processed so you know how to reach that particular destination in case you send to it in the future.
nullplan wrote: Anyway, as to what other functions there are in a NIC: Many of them support checksum offloading. That means that the OS does not compute TCP/UDP checksum, IP checksum, or Ethernet CRC, and just lets the NIC do those things.
I think every NIC will handle the Ethernet CRC (at least the one's I've studied does). Offloading the TCP/IP checksums is a lot more complicated, and so is providing a zero-copy stack. There is also a conflict between getting high performance with older cards without scatter-gather and checksum offloading and newer cards that support those.

Maybe the hardest challenge, which I currently cannot handle, is to handle ARP flooding that leads to NIC overload and buffer overflows which will typically turn off the NIC and require some kind of restart on a overloaded network.