Blog: PHP, Python, Linux, Web services & Continuous delivery

Downloading HTTP resources in python using the requests library

Downloading resources in python can be a tedious job unless you find the right library to help. For cases where you just want to grab a plain url then the standard library's urllib2 is sufficient but if you need any extras such as basic authentication, keep-alives, or connection pooling then there are better alternatives available.

The requests module is a comprehensive 'pythonic' library for dealing with HTTP. It's essentially a high level wrapper over the functionality included in both urllib2 and urllib3 but with a few very useful extras. It's clean, hassle free approach is best highlighted with some examples.

Get

Get a URL and print the HTTP response code, all possible response codes are available as built-in lookups so the code is kept clean from comparisons such as == "200" or !="404".

1 r = requests.get('http://pipe-devnull.com/static/samples/test1.txt')
2 print r.status_code
3 # 200
4 
5 if (r.status_code == requests.codes.ok):
6   print "request ok!"
7 
8 print r.text
9 # prints body of the response

Raise a timeout exception if the get operation takes longer than half a second (hopefully this site will respond in under half a second).

r = requests.get("http://pipe-devnull.com/static/samples/test1.txt", timeout=0.5)

All headers are easily accessible in a dictionary in the response object.

print r.headers["content-type"]
# outpus 'text/html'

By default, SSL certificate verification is set to true so an exception will automatically be raised if there is a hostname mismatch. Although it would be nice to always develop in an environment that uses real certs we often have to make do with self signed certs. You can avoid these exceptions during development by setting verify to false (don't let this slip into production though).

requests.get("https://selfsigned.example.com", verify=False);

Authentication

Authentication is very easy and quick using the requests library.

Basic Authentication

The most common form, adding this detail is simple:

r = requests.get("https://api.github.com", auth=("user", "pass"))

Want to see some equivalent code using the standard urllib2 library? urllib2 vs requests - enough said.

Digest Authentication

from requests.auth import HTTPDigestAuth
# Basic authentication is the default scheme
r = requests.get("http://example.com/private", auth=HTTPDigestAuth("user", "pass"))

Keep Alives

Requests enables keep-alives by default When you create a session. It's a common performance booster when scripting repetitive download/fetch routines due to the reduction in expensive TCP connection setup.

# Setup a connection pool and use keep-alives
s = requests.session()
s.config["keep_alive"] = False/True

The library also used to support parallel asynchronous file downloads but that feature has since been removed and ported over to a separate module called grequests

The library has a many other features to help with cookie management, setting POST data, request streaming etc. which you can find much more about in excellent official documentation.

comments powered by Disqus