Blog: PHP, Python, Linux, Web services & Continuous delivery

A multi-threaded downloader script with URL queuing

I created this script in order to try out the standard queue and thread libraries in Python. It doesn't do anything exciting but can be quite useful if you have a bunch of files to download. It accepts either a list of files as a command line parameter or a path to a JSON file, the format of which is outlined a bit further down.

Read about the python queue library in the docs here and the thread library here. It also requires the requests library to fetch the resources (see earlier post on Downloading HTTP resources in python using the requests library )

I'll add an example in a future post of how to write a basic web crawler / spider using scrapy that outputs results to a JSON file which can then be fed into this downloader script.

Script overview:

The script contains a downloader class, a download_manager class and a main entry function that parses CLI arguments, JSON files etc. The download manager class has two basic responsibilities:

  • Create a queue of URLs that are to be downloaded
  • Create a thread pool of n threads and register the queue with each thread

Which just leaves the downloader responsible for:

  • Fetching a URL from the queue and downloading it using the requests library
  • Writing the fetched resource to the local file system
  • Timing the download

Sample usage:

pydownload.py ./output_directory/ -i <JSONinputfile>  -f <url1,url2,url3...>

Sample Output:

----------pydownnload---------------

JSON file:             download-items.json
Output Directory:      ./test/
File list:             None

* Thread 39541d3a6cc3851db70d6b220264a3ce - processing URL
* Thread 51c43adfd6228462fe8548352cbc682b - processing URL
* Thread: caf11370ca87b10c1b518626e5ee3b80 Downloaded http://pipe-devnull.com/static/samples/download/Faith_no_More_King_For_A_Day.mp3 in 2.94 seconds
* Thread: 39541d3a6cc3851db70d6b220264a3ce Downloaded http://pipe-devnull.com/static/samples/download/John_Frusciante_Walls_and_Doors.mp3 in 3.33 seconds

If you want to pass a JSON file to the script the JSON file must conform to the following format:

[
{"link_name": "Faith_no_More_King_For_A_Day.mp3", "link_address": "http://pipe-devnull.com/static/samples/download/Faith_no_More_King_For_A_Day.mp3"},
{"link_name": "John_Frusciante_Walls_and_Doors.mp", "link_address": "http://pipe-devnull.com/static/samples/download/John_Frusciante_Walls_and_Doors.mp"},
]

download sample JSON file

The Code:

download pydownload.py

  1 # Multi purpose downloader script
  2 # 
  3 # - requests library for HTTP operations
  4 # - standard library's queue library 
  5 # - standard library's threads library
  6 
  7 import os
  8 import Queue
  9 import threading
 10 import sys
 11 import getopt
 12 import requests
 13 import json
 14 import time
 15 
 16 
 17 #Downloader class - reads queue and downloads each file in succession
 18 class Downloader(threading.Thread):
 19     """Threaded File Downloader"""
 20 
 21     def __init__(self, queue, output_directory):
 22             threading.Thread.__init__(self,name= os.urandom(16).encode('hex'))
 23             self.queue = queue
 24             self.output_directory = output_directory
 25 
 26     def run(self):
 27         while True:
 28             # gets the url from the queue
 29             url = self.queue.get()
 30 
 31             # download the file
 32             print "* Thread " + self.name + " - processing URL"
 33             self.download_file(url)
 34 
 35             # send a signal to the queue that the job is done
 36             self.queue.task_done()
 37 
 38     def download_file(self, url):
 39         t_start = time.clock()
 40 
 41         r = requests.get(url)
 42         if (r.status_code == requests.codes.ok):
 43             t_elapsed = time.clock() - t_start
 44             print "* Thread: " + self.name + " Downloaded " + url + " in " + str(t_elapsed) + " seconds"
 45             fname = self.output_directory + "/" + os.path.basename(url)
 46 
 47             with open(fname, "wb") as f:
 48                 f.write(r.content)
 49         else:
 50             print "* Thread: " + self.name + " Bad URL: " + url
 51 
 52 
 53 # Spawns dowloader threads and manages URL downloads queue
 54 class DownloadManager():
 55 
 56     def __init__(self, download_dict, output_directory, thread_count=5):
 57         self.thread_count = thread_count
 58         self.download_dict = download_dict
 59         self.output_directory = output_directory
 60 
 61     # Start the downloader threads, fill the queue with the URLs and
 62     # then feed the threads URLs via the queue
 63     def begin_downloads(self):
 64         queue = Queue.Queue()
 65 
 66         # Create a thread pool and give them a queue
 67         for i in range(self.thread_count):
 68             t = Downloader(queue, self.output_directory)
 69             t.setDaemon(True)
 70             t.start()
 71 
 72         # Load the queue from the download dict
 73         for linkname in self.download_dict:
 74             #print uri
 75             queue.put(self.download_dict[linkname])
 76 
 77         # Wait for the queue to finish
 78         queue.join()
 79 
 80         return
 81 
 82 
 83 # Main.  Parse CLIoptions, prepare download list & start downloading
 84 def main(argv):
 85     inputfile = None
 86     flist = None
 87     help = 'pydownload.py ./output_directory/ -i <JSONinputfile>  -f <url1,url2,url3...>'
 88     try:
 89         opts, args = getopt.getopt(argv, "hi:f:", ["ifile=", "flist="])
 90     except getopt.GetoptError:
 91         print help
 92         sys.exit(2)
 93 
 94     # Check for required script argument output dir
 95     if len(args) > 0:
 96         output_directory = args[0]
 97     else:
 98         print help
 99         sys.exit(2)
100 
101     for opt, arg in opts:
102         if opt == '-h':
103             print help
104             sys.exit(2)
105         elif opt in ("-i", "--ifile"):
106             inputfile = arg
107         elif opt in ("-f", "--flist"):
108             flist = [i for i in arg.split(',')]
109 
110     print '----------pydownnload---------------'
111     print '------------------------------------'
112     print 'JSON file:             ', inputfile
113     print 'Output Directory:      ', output_directory
114     print 'File list:             ', flist
115     print '------------------------------------'
116 
117     # Now build a dict of urls to download, just add any flist urls
118     download_dict = {}
119 
120     # If the input file is supplied then parse it as JSON and add to dict of URLS
121     if (inputfile is not None):
122         fp = open(inputfile)
123         url_list = json.load(fp)
124         for url in url_list:
125             download_dict[url['link_name']] = url['link_address']
126 
127      # Add in any additional files contained in the flist variable
128     if (flist is not None):
129     for f in flist:
130         download_dict[str(f)] = f
131 
132     # If there are no URLs to download then exit now, nothing to do!
133     if len(download_dict) is 0:
134         print "* No URLs to download, got the usage right?"
135         print "USAGE: " + help
136         sys.exit(2)
137 
138     download_manager = DownloadManager(download_dict, output_directory, 5)
139     download_manager.begin_downloads()
140 
141 # Kick off
142 if __name__ == "__main__":
143     main(sys.argv[1:])
comments powered by Disqus