Functional Iteration: Better than Pythonic

TIRED OF writing for loops yet? Let’s take some inspiration from functional programming to improve our Python programming, making use of the inbuilt functions.

To highlight the benefits of this approach, we’ll make use of a semi-realistic example based on a common use of Python, web-scraping. We’ll start with a list of URLs to scrape, and we want to get all the linked URLs.

Start With Pythonic

WE’LL MAKE use of some inbuilt libraries, including BeautifulSoup and requests, to start from pythonic, imperative code.

I’ve avoided using any functions here to make the process of refactoring this into a functional paradigm simpler to follow.

from bs4 import BeautifulSoup
from requests import get

to_scrape = ["http://www.example.com"]

# Make a list to store our urls
all_urls = list()

# Iterate through urls in to_scrape
for url in to_scrape:
    # Get the HTML
    r = get(url)
    # Check we did get the HTML
    if r.status_code == 200:
        # Make a soup to find the anchors
        soup = BeautifulSoup(r.content,
                             "lxml")
        page_as = soup.find_all('a')
        # Iterate through the anchors
        for anchor in page_as:
            href = anchor.get('href')
            # Store the URL
            all_urls.append(href)

for url in all_urls:
    print(url)
# Output:
# http://www.iana.org/domains/example

Step 1: Abstract the Function

THE FIRST step is to abstract the function from the iteration. On lines 20-25 our intention is to extract the href value, which is the URL, from the anchor.

In the end, we won’t need to worry about storing the href in all_urls, so we’ll skip that too.

def get_href(anchor):
    return anchor.get('href')
        for anchor in page_as:
            href = get_href(anchor)

This change looks so trivial, you may wonder “Why bother?”. By changing the Object method (anchor.get) to a function, we can now use map, which is a one line iterator that has lots of benefits.

Like a Python generator, map is lazily evaluated. This means you’re not filling your memory with data you don’t need yet, or computing values you may not need. So now, we don’t need to pre-declare our all_urls list, which we did on lines 6 and 7. So our updated code is:

from bs4 import BeautifulSoup
from requests import get

to_scrape = ["http://www.example.com"]

def get_href(anchor):
    return anchor.get("href")

for url in to_scrape:
    r = get(url)
    if r.status_code == 200:
        soup = BeautifulSoup(r.content,
                             "lxml")
        page_as = soup.find_all('a')
        all_urls = map(get_href, page_as)

for url in all_urls:
    print(url)
# Output:
# http://www.iana.org/domains/example

Step 2: Handling if

REFACTORING LINES 10-14, which get the html and extract the anchors, needs to be handled slightly differently because of the if statement in there. Our intention is to prevent BeautifulSoup from attempting to create a soup if we didn’t get the expected response, which could cause the program to crash or return invalid results.

We need to remove any results where the status_code is not 200, to do this we will filter the results using filter. The filter function is also lazily evaluated. The final function required needs to make a BeautifulSoup and extract the page anchors. This results in a 2D list, and we need a flat list, so we use a function from itertools to flatten it.

from bs4 import BeautifulSoup
from itertools import chain
from requests import get

to_scrape = ["http://www.example.com"]

def get_href(anchor):
    return anchor.get("href")

def status_200(result):
    return result.status_code == 200

def get_anchors(result):
    soup = BeautifulSoup(result.content,
                         "lxml")
    return soup.find_all('a')

results = map(get, to_scrape)
valid_results = filter(status_200, results)
page_as = map(get_anchors, valid_results)
flat_anchors = chain.from_iterable(page_as)
all_urls = map(get_href, flat_anchors)

for url in all_urls:
    print(url)
# Output:
# http://www.iana.org/domains/example

Benefit 1: Function Reuse

WE’VE GOT three functions that we might find useful. We could decompose get_anchors into two more functions, but we’ll work with what we’ve got. It’s common to scrape a page for URLs, get the content at those URLs, and find all the URLs on those pages.

We’ve declared all the functions we need to get the URLs from an initial page, by abstracting them from the iteration, we can now use them elsewhere. This will run if you add it into the existing code.

# Re-use the functions to get URLs
# from one page
result = get("http://www.example.com")
if status_200(result):
    anchors = get_anchors(result)
else:
    raise ValueError("No initial source")
to_scrape = map(get_href, anchors)

Benefit 2: Easy Parallelisation

ONCE WE’VE made a working application, we’ll probably want to parallelise parts of it, such as the get requests, to speed up our scraping. When working with imperative code, this is an intimidating task. By using functional style iteration, we’ve avoided mutating any data and we can add parallelisation with a single function.

To add parallelisation, we need a parallel version of map. Sadly one isn’t built in to Python, but we can write our own easily and reuse it in lots of projects. Compare this to what you’d have to do to parallelise a list/generator comprehension. Aren’t you glad Guido relented and let us keep map?

def pmap(f, iterable):
    with Pool(4) as pool:
        yield from pool.map(f, iterable)

# results = map(get, to_scrape)
results = pmap(get, to_scrape)

Conclusion

A FUNCTIONAL approach to iteration, using map and filter, has helped to abstract our functions from our iteration. I’ve described two benefits that are easy to see and difficult to deny, but I also believe that this code is easier to read and can be more memory efficient. If I weren’t restricted to 40 character lines in this blog, I’d have chained the maps and filters together without variable names between. I find this easier to parse, but that might just be me.

Finally, I’d be remiss if I didn’t mention reduce, it’s the third big functional style tool for iteration, you’ll find it in functools. It’s used for iterating whilst combining the elements in some way, the inbuilt sum function can be written with reduce.

Until next time, happy coding. Paul