Monday, May 2, 2016

Download All Images from a Wordpress.com Media Library

After a few months of blogging photos from my life events, I realised that I should have kept a local version (dummy me). I kept everything else, so why did I forget this time?!
Anyways, I realised that this is not an easy job if you are free hosting on Wordpress.com. All I could have is either an XML with posts and references to images, or a full migration to another Wordpress blog.
So I rolled up my sleeves and reused a python script I used before.

Here is the script:
#!/usr/bin/python
# -*- coding: utf-8 -*-

# This script parses a wordpress backup XML and downloads the contained media files.
# Example:
# python wordpress-media-downloader.py  /Users/madly/Downloads/mycoolblog.wordpress.2016-05-02.xml /Users/madly/Downloads/media-library

import sys
import urllib2
import re

#
# Takes a url and a directory for saving the file. Directory must exist.
#
def download(url, dir_name):
    file_name = url.split('/')[-1]
    u = urllib2.urlopen(url)
    f = open(dir_name+'/'+file_name, 'wb')
    meta = u.info()
    file_size = int(meta.getheaders("Content-Length")[0])
    print "Downloading File: %s (Size: %s Bytes)" % (file_name, file_size)

    file_size_dl = 0
    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        file_size_dl += len(buffer)
        f.write(buffer)
        status = r"%10d  [%3.2f%%]" % (file_size_dl, file_size_dl * 100. / file_size)
        status = status + chr(8)*(len(status)+1)
        print status,

    f.close()

#
# Take file name and directory parameters from the user call
#
file_name = sys.argv[1]
dir_name = sys.argv[2]
img_regex = '(http:\/\/.*files\.wordpress\.com.*\.(?:jpeg|jpg|png))'

#
# Get the images urls from the xml file
#
urls_to_download = []
file = open(file_name, "r")

for line in file.readlines():
    m = re.search(img_regex, line)
    if m:
        img_url = m.group(1)
        urls_to_download.append(img_url)

urls_count = len(urls_to_download)
print("Images to download: %s images." % (urls_count))

#
# Download images
#
i = 0
for url in urls_to_download:
    i += 1
    print("File %s/%s" % (i, urls_count))
    download(url, dir_name)
    # print(url) # use this line instead of download() if you want to export the output to a download manager

I simply call the script and pass the backup XML file I've just downloaded from my Wordpress.com admin panel (/wp-admin) and the path to the folder where I want to save the images.



One more update can be to use a better parsing technique and use an element called wp:post_id to sort the images into folders by their post. But I am satisfied with the current result. I can do the rest manually as it will be faster in my case :)

No comments:

Post a Comment