TIRED OF writing
for loops yet? Let’s take some inspiration from functional
programming to improve our Python programming, making use of the inbuilt
To highlight the benefits of this approach, we’ll make use of a semi-realistic example based on a common use of Python, web-scraping. We’ll start with a list of URLs to scrape, and we want to get all the linked URLs.
Start With Pythonic
I’ve avoided using any functions here to make the process of refactoring this into a functional paradigm simpler to follow.
from bs4 import BeautifulSoup from requests import get to_scrape = ["http://www.example.com"] # Make a list to store our urls all_urls = list() # Iterate through urls in to_scrape for url in to_scrape: # Get the HTML r = get(url) # Check we did get the HTML if r.status_code == 200: # Make a soup to find the anchors soup = BeautifulSoup(r.content, "lxml") page_as = soup.find_all('a') # Iterate through the anchors for anchor in page_as: href = anchor.get('href') # Store the URL all_urls.append(href) for url in all_urls: print(url) # Output: # http://www.iana.org/domains/example
Step 1: Abstract the Function
THE FIRST step is to abstract the
function from the iteration. On lines 20-25 our intention is to extract
href value, which is the URL, from the anchor.
In the end, we won’t need to worry about storing the
we’ll skip that too.
def get_href(anchor): return anchor.get('href')
for anchor in page_as: href = get_href(anchor)
This change looks so trivial, you may wonder “Why bother?”.
By changing the Object method (
anchor.get) to a function, we can now use
map, which is a one line iterator that has lots of benefits.
Like a Python generator, map is lazily evaluated. This means you’re not filling
your memory with data you don’t need yet, or computing values you may not need.
So now, we don’t need to pre-declare our
all_urls list, which we did on lines
6 and 7. So our updated code is:
from bs4 import BeautifulSoup from requests import get to_scrape = ["http://www.example.com"] def get_href(anchor): return anchor.get("href") for url in to_scrape: r = get(url) if r.status_code == 200: soup = BeautifulSoup(r.content, "lxml") page_as = soup.find_all('a') all_urls = map(get_href, page_as) for url in all_urls: print(url) # Output: # http://www.iana.org/domains/example
Step 2: Handling
REFACTORING LINES 10-14, which get the html and extract the anchors, needs to be handled slightly differently because of the if statement in there. Our intention is to prevent BeautifulSoup from attempting to create a soup if we didn’t get the expected response, which could cause the program to crash or return invalid results.
We need to remove any results where the
status_code is not
200, to do this
we will filter the results using
filter. The filter function is
also lazily evaluated. The final function required needs
to make a BeautifulSoup and extract the page anchors. This results in a 2D list,
and we need a flat list, so we use a function from itertools to flatten it.
from bs4 import BeautifulSoup from itertools import chain from requests import get to_scrape = ["http://www.example.com"] def get_href(anchor): return anchor.get("href") def status_200(result): return result.status_code == 200 def get_anchors(result): soup = BeautifulSoup(result.content, "lxml") return soup.find_all('a') results = map(get, to_scrape) valid_results = filter(status_200, results) page_as = map(get_anchors, valid_results) flat_anchors = chain.from_iterable(page_as) all_urls = map(get_href, flat_anchors) for url in all_urls: print(url) # Output: # http://www.iana.org/domains/example
Benefit 1: Function Reuse
WE’VE GOT three functions that we might find useful. We could decompose
get_anchors into two more functions, but we’ll work with what we’ve got. It’s
common to scrape a page for URLs, get the content at those URLs, and find all the
URLs on those pages.
We’ve declared all the functions we need to get the URLs from an initial page, by abstracting them from the iteration, we can now use them elsewhere. This will run if you add it into the existing code.
# Re-use the functions to get URLs # from one page result = get("http://www.example.com") if status_200(result): anchors = get_anchors(result) else: raise ValueError("No initial source") to_scrape = map(get_href, anchors)
Benefit 2: Easy Parallelisation
ONCE WE’VE made a working application, we’ll probably want to parallelise parts of it, such as the get requests, to speed up our scraping. When working with imperative code, this is an intimidating task. By using functional style iteration, we’ve avoided mutating any data and we can add parallelisation with a single function.
To add parallelisation, we need a parallel version of map. Sadly one isn’t built in to Python, but we can write our own easily and reuse it in lots of projects. Compare this to what you’d have to do to parallelise a list/generator comprehension. Aren’t you glad Guido relented and let us keep map?
def pmap(f, iterable): with Pool(4) as pool: yield from pool.map(f, iterable) # results = map(get, to_scrape) results = pmap(get, to_scrape)
A FUNCTIONAL approach to iteration, using
has helped to abstract our functions from our iteration. I’ve described two
benefits that are easy to see and difficult to deny, but I also believe that
this code is easier to read and can be more memory efficient. If I
weren’t restricted to 40 character lines in this blog, I’d have chained the
maps and filters together without variable names between. I find this easier to
parse, but that might just be me.
Finally, I’d be remiss if I didn’t mention
reduce, it’s the third big
functional style tool for iteration, you’ll find it in
It’s used for iterating whilst combining the elements in some way, the inbuilt
sum function can be written with
Until next time, happy coding. Paul