Functional Iteration: Better than Pythonic
TIRED OF writing for
loops yet? Let’s take some inspiration from functional
programming to improve our Python programming, making use of the inbuilt
functions.
To highlight the benefits of this approach, we’ll make use of a semi-realistic example based on a common use of Python, web-scraping. We’ll start with a list of URLs to scrape, and we want to get all the linked URLs.
Start With Pythonic
WE’LL MAKE use of some inbuilt libraries, including BeautifulSoup and requests, to start from pythonic, imperative code.
I’ve avoided using any functions here to make the process of refactoring this into a functional paradigm simpler to follow.
from bs4 import BeautifulSoup
from requests import get
to_scrape = ["http://www.example.com"]
# Make a list to store our urls
all_urls = list()
# Iterate through urls in to_scrape
for url in to_scrape:
# Get the HTML
r = get(url)
# Check we did get the HTML
if r.status_code == 200:
# Make a soup to find the anchors
soup = BeautifulSoup(r.content,
"lxml")
page_as = soup.find_all('a')
# Iterate through the anchors
for anchor in page_as:
href = anchor.get('href')
# Store the URL
all_urls.append(href)
for url in all_urls:
print(url)
# Output:
# http://www.iana.org/domains/example
Step 1: Abstract the Function
THE FIRST step is to abstract the
function from the iteration. On lines 20-25 our intention is to extract
the href
value, which is the URL, from the anchor.
In the end, we won’t need to worry about storing the href
in all_urls
, so
we’ll skip that too.
def get_href(anchor):
return anchor.get('href')
for anchor in page_as:
href = get_href(anchor)
This change looks so trivial, you may wonder “Why bother?”.
By changing the Object method (anchor.get
) to a function, we can now use
map
, which is a one line iterator that has lots of benefits.
Like a Python generator, map is lazily evaluated. This means you’re not filling
your memory with data you don’t need yet, or computing values you may not need.
So now, we don’t need to pre-declare our all_urls
list, which we did on lines
6 and 7. So our updated code is:
from bs4 import BeautifulSoup
from requests import get
to_scrape = ["http://www.example.com"]
def get_href(anchor):
return anchor.get("href")
for url in to_scrape:
r = get(url)
if r.status_code == 200:
soup = BeautifulSoup(r.content,
"lxml")
page_as = soup.find_all('a')
all_urls = map(get_href, page_as)
for url in all_urls:
print(url)
# Output:
# http://www.iana.org/domains/example
Step 2: Handling if
REFACTORING LINES 10-14, which get the html and extract the anchors, needs to be handled slightly differently because of the if statement in there. Our intention is to prevent BeautifulSoup from attempting to create a soup if we didn’t get the expected response, which could cause the program to crash or return invalid results.
We need to remove any results where the status_code
is not 200
, to do this
we will filter the results using filter
. The filter function is
also lazily evaluated. The final function required needs
to make a BeautifulSoup and extract the page anchors. This results in a 2D list,
and we need a flat list, so we use a function from itertools to flatten it.
from bs4 import BeautifulSoup
from itertools import chain
from requests import get
to_scrape = ["http://www.example.com"]
def get_href(anchor):
return anchor.get("href")
def status_200(result):
return result.status_code == 200
def get_anchors(result):
soup = BeautifulSoup(result.content,
"lxml")
return soup.find_all('a')
results = map(get, to_scrape)
valid_results = filter(status_200, results)
page_as = map(get_anchors, valid_results)
flat_anchors = chain.from_iterable(page_as)
all_urls = map(get_href, flat_anchors)
for url in all_urls:
print(url)
# Output:
# http://www.iana.org/domains/example
Benefit 1: Function Reuse
WE’VE GOT three functions that we might find useful. We could decompose
get_anchors
into two more functions, but we’ll work with what we’ve got. It’s
common to scrape a page for URLs, get the content at those URLs, and find all the
URLs on those pages.
We’ve declared all the functions we need to get the URLs from an initial page, by abstracting them from the iteration, we can now use them elsewhere. This will run if you add it into the existing code.
# Re-use the functions to get URLs
# from one page
result = get("http://www.example.com")
if status_200(result):
anchors = get_anchors(result)
else:
raise ValueError("No initial source")
to_scrape = map(get_href, anchors)
Benefit 2: Easy Parallelisation
ONCE WE’VE made a working application, we’ll probably want to parallelise parts of it, such as the get requests, to speed up our scraping. When working with imperative code, this is an intimidating task. By using functional style iteration, we’ve avoided mutating any data and we can add parallelisation with a single function.
To add parallelisation, we need a parallel version of map. Sadly one isn’t built in to Python, but we can write our own easily and reuse it in lots of projects. Compare this to what you’d have to do to parallelise a list/generator comprehension. Aren’t you glad Guido relented and let us keep map?
def pmap(f, iterable):
with Pool(4) as pool:
yield from pool.map(f, iterable)
# results = map(get, to_scrape)
results = pmap(get, to_scrape)
Conclusion
A FUNCTIONAL approach to iteration, using map
and filter
,
has helped to abstract our functions from our iteration. I’ve described two
benefits that are easy to see and difficult to deny, but I also believe that
this code is easier to read and can be more memory efficient. If I
weren’t restricted to 40 character lines in this blog, I’d have chained the
maps and filters together without variable names between. I find this easier to
parse, but that might just be me.
Finally, I’d be remiss if I didn’t mention reduce
, it’s the third big
functional style tool for iteration, you’ll find it in functools
.
It’s used for iterating whilst combining the elements in some way, the inbuilt
sum
function can be written with reduce
.
Until next time, happy coding. Paul