pfSense Multi-WAN - How to really make it work
Please see updates for notes I've added since the original article.
For those who haven't come across it yet, pfSense, is THE BEST Network Gateway product there is. When you couple that with it being an Open Source project, then its real value is astounding. I will not attempt to list its features here. If you want to find out more, and it's well worth it, try this link.
Everything that follows is based on lab tests and a real pfSense deployment using the 2.01 release. So this configuration actually works (given the issues documented within). I have read many posts and articles about this subject that showed configurations that plainly would not work or were based on out of date software, that I felt moved to document a working configuration.
I am involved in a project that is using pfSense as the Network Gateway between a distributed WiFi Network and multiple WAN links that allow the WiFi users to have Internet access. In this environment a number of features of pfSense are used to provide the Network Services for the WiFi users.
DHCP Server - Used to automatically assign IP addresses and configuration to connecting users/devices.
DNS Forwarder - Used to allow DHCP configured devices to perform DNS lookups without having to know the DNS configuration used with the multiple connected WAN links.
NTP Server - Used to allow DHCP configured devices to get time and date simply from the pfSense Server.
Captive Portal - Used to provide the DHCP configured devices with a web based authentication and bandwidth control system. Backend Authentication and Accounting (for data volumes consumed) is handled by a separate RADIUS Server.
HTTP Proxy - Used to provide web access control (along with the Proxy filter) and caching.
Multi-WAN - Used to load balance outbound traffic and fail over using multiple WAN connections to the Internet. Not really a service but a configuration that allows for balancing and fail over.
Firewall - Used to provide further security, Network Address Translation and provide the rules for Multi-WAN.
Other pfSense features are used in the Network (below) but have been omitted here as I am limiting this article to the configuration and issues with Multi-WAN.
A Real Network Deployment.
General notes regarding Multi-WAN operation.
But first of all, a general note about Multi-WAN. The most important consideration when looking at the configuration required to enable Multi-WAN, is to ask where the traffic comes from. This is importance because for traffic to be routed over multiple links it must pass through an interface where the routing can be applied. There are really just two sources, internal to the pfSense Server and external to it. External traffic is the simplest to consider as it obviously hits a physical LAN Interface on the way into the pfSense Server (in the above case from the distributed WiFi Network). So routing can be applied at that point. Internal traffic is more difficult as most Network Services running within a Server will attach (technically bind) themselves to any (and all) IP addresses/Interfaces that exist on the Server. This means that outbound traffic does not pass through any Interface before it hits the final physical WAN Interface because it is already there. So any routing applied at this point is too late. The way round this is to get the Network Service to use the loopback Interface (127.0.0.1) so the routing is applied at this Interface before it hits the WAN Interface and outbound load balancing and fail over can work!
Due to the need to bind to the local loopback Interface, the default Gateway Network Address Settings are used so that pfSense generate the NAT rules.
Steps taken to implement Multi-WAN.
I will take each of the Network Services in turn and discuss the Multi-WAN issues associated with each.
No issues here, this was just included as it is used to supply the IP configuration that clients on WiFi Network will use. This also explains why the NTP and DNS forwarder on the pfSense Server are utilised by these clients. The IP address of the pfSense Server on the WiFi Network Interface is supplied as both the DNS and NTP Server addresses.
DNS Forwarder (dnsmasq)
We have to provide a DNS service for clients to use to lookup Domain names and get IP addresses from. The real DNS Servers used on the WAN side of pfSense Server will be dictated by the connections allowed and services provided by the ISPs providing the WAN connections. The DNS Forwarder does the real DNS lookups and can cache the results. It can also be configured to use particular DNS Servers via a particular connected WAN Gateway. By default the DNS Forwarder binds to any and all Interfaces which prevents load balancing from working (as discussed before). However, the DNS settings for pfSense allow a way round this which does not provide load balancing but does provide fail over support.
In the environment I have used, there are two DNS servers where both can be accessed over either link.
In "System > General Setup", I have configured one DNS Server via the Gateway on the first WAN connection and the other on the second connection. For good measure I have also configured the Google DNS Servers split over the two WAN links as below (IPs changed to protect the innocent).
This causes pfSense to create specific routes to the DNS Servers via the specified Gateways. The effect is that even if your default Gateway goes down the DNS Forwarder can still reach a DNS Server on the Internet.
Firewall note, I had to add a rule to the Firewall on the LAN Interface for the WiFi Network to allow any IP on that LAN to use DNS (UDP port 53) on the pfSense Server on its IP for that LAN Interface.
We provide an NTP Server to allow clients to synchronise their clocks. It appears that the NTP Server in pfSense also binds to any and all so will not load balance or support fail over. I intend to look into this when time allows (no joke intended). For the moment, this is not considered critical as we can manually switch default gateways at some time after a WAN link failure.
Firewall note, I had to add a rule to the Firewall on the LAN Interface for the WiFi Network to allow any IP on that LAN to use NTP (UDP port 123) on the pfSense Server on its IP for that LAN Interface.
We use the Captive Portal in pfSense to stop all Internet access until a user on the WiFi Network has logged in. Until the user has logged in, Captive Portal will intercept any attempt to use a web site and redirect the request to a page hosted on the pfSense Server (TCP port 8000). In our environment, this page redirects to an external Web Server that gathers the Login details and passes them back to the Captive Portal which then authenticates with an external RADIUS Server. Once the user has successfully authenticated, Captive Portal then regulates the bandwidth allocated to that user and records their data consumption back to the external RADIUS Servers.
This means that two types of traffic are involved. Web access for the authentication dialogue and RADIUS for verifying the authentication (and passing back bandwidth limiting data) and for recording data usage values. The Web access is simple as it hits the WiFi Network's LAN Interface and so can be routed as discussed before. The difficulty lies with the interaction with RADIUS as this originates locally on the pfSense Server. Technically, this is implemented in the RADIUS PECL PHP extension which seems to have the usual issue of binding to any and all Interfaces thus making all RADIUS communications only flow via the default Gateway. This is much more of an issue than with the NTP Server as it will prevent any logins if the default Gateway is down. Also, no accounting data will be recorded under these circumstances.
Note to self - I need to investigate this further!
HTTP Proxy (squid)
We run squid in transparent mode so that users do not need to configure their browsers to use the proxy. Which covers HTTP traffic but as HTTPS cannot be handled by a proxy (it would be a protocol security exposure if you could) this gives two paths through pfSense for Web traffic, HTTP and HTTPS.
HTTPS - Is actually the simple case as the traffic hits the LAN Interface on the WiFi Network and is routable.
HTTP - Is trapped by squid and those requests that actually need to honoured by a real request to the Internet come from squid itself running locally on the pfSense Server. Luckily squid can be configured to bind to the loopback adapter.
In "Services > Proxy server" insert "tcp_outgoing_address 127.0.0.1;" at the start of the "Custom options".
Warning: Some guides and posts say that squid must be configured with the "loopback" Interface selected as well as the LAN Interface in the "General > Proxy Interface" settings. This is not needed and causes problems.
Firewall note, I found that adding a rule to the Firewall on the LAN Interface for the WiFi Network to allow any IP on that LAN to use TCP port 3128 on any IP (has to be any IP!) was a good idea to cut down firewall logging. Incidentally, I found there was no need to add to rule to allow HTTP (TCP port 80) though as squid hooks this in itself.
Now before we can route traffic out in a load balanced and fail over aware way, we need to define the path out to the Internet. There are many ways to set this up but to simplify the configuration I'm going to use the environment that I'm working on as an example. In this case we have two WAN links which go via two local Routers. Each of which has a local Private IP on the LAN that connects to the pfSense Server's WAN Interfaces and a fixed Public IP on the Internet side. We balance traffic equally over the two WAN links and allow fail over from either to either.
The two Routers are defined in "System > Routing > Gateways", which they would need to be anyway even without any balancing or fail over. I have set the "Monitor IP" to the Public address of the Routers (once again changed in the image to protect the innocent) so that if their Internet connection is dropped this will be detected.
Now the Routers can be grouped in "System > Routing > Groups". Both Routers are added to a group called "WANGroup" and both as "Tier 1". This makes the both the same priority for fail over and balancing.
The last part of the puzzle!
The Firewall rules control where the Network traffic flows. Not just in terms of when is allowed to pass through but over what path.
Pulling together the previous discussions, the Firewall rules for the WiFi LAN Interface are as below. There are a couple of rules that have not been discussed so far. The first rules are just to allow PINGs for diagnostics and client testing. Also the rule that allows HTTPS to the pfSense Server is there to allow support staff to access the pfSense Server from the field. The penultimate rule blocks any other access to the pfSense Server.
The order of the rules is very important. All the rules before the final rule not only allow access to various services on the pfSense Server but also prevent traffic from being affected by outbound load balancing and fail over. The final rule, allows all outbound traffic to be balanced and fail over eligible by specifying the Gateway to be used as "WANGroup". This covers HTTPS (that does not go out through the squid Proxy) and other protocols such as e-mail clients etc.
There is one final rule that is used to handle the traffic from local services on the pfSense server itself. From the previous discussions, this is ONLY requests from the squid Proxy that need to be passed out to the Internet. This is defined in "Firewall > Rules > Floating";
It's important to note that floating rules can be associated with any number of Interfaces. This rule is linked to both WAN interfaces, which stops some complications where pfSense has to create additional (negation) rules to prevent other interfaces (LAN rather than WAN) from being affected by this rule. Floating rules also have a "Direction" setting which is set to "out" so that only outbound traffic is affected. It's also important that this rule has the "Quick" option disabled (unticked) so that other rules (those above) are applied first. The source IP address is set to "*" (any) so that it will apply to any source IP address that uses the HTTP destination port and has not already been caught by one of the other rules, a catch-all for this traffic. Due to the restrictions discussed before, it's only the squid traffic heading out from the loopback Interface (127.0.0.1) that will hit this rule.
Well it works for traffic directly from the WiFi Network, it's sort of working for DNS but not working with NTP (not too big an issue) and RADIUS (bit of a problem this one).
I also didn't mention SYSLOG, for passing system messages to some central logging system (I'm using Graylog2). This is another case where a local service won't load balance.
Overall, a bit more work is required for local services but its a brilliant product.
Thanks for reading
If you have any comments, please contact me.
Richard Gate, CommuniG8