yanosh.net

How to Bulk Download Images from DuckDuckGo (for Deep Learning)

Published: 2024-03-01

TL;DR

I ended up using screen scraping in my browser. However, if you prefer Python and want full automation, you may use fastai's library, fastbook. It contains the search_images_ddg() function.

You can also try my version of search_images_ddg() written in PHP.

Some Background

Recently, I needed a large number of images of dogs sitting on couches. I use DuckDuckGo as my default search engine, so I wondered if I could use their API to download image search results. Ideally, I would receive a JSON array containing URLs of images corresponding to my search query.

Unfortunately, DuckDuckGo does not seem to have any documentation for its API. At the time of this writing, the DuckDuckGo Help Pages only return two results when searching for “API”: https://duckduckgo.com/duckduckgo-help-pages/search/?q=API, and they do not appear to be related to any actual APIs.

DuckDuckGo Help Pages Screenshot

Additionally, in this Stack Overflow post, it is mentioned that DuckDuckGo provides a very limited API, and this is due to the way they generate results. Primarily because DuckDuckGo is sourcing its results from Bing.

Using fastai’s search_images_ddg() function (Python)

fastai in their fastbook library, which accompanies the Deep Learning for Coders book (also available on Amazon), has a function called search_images_ddg().

I couldn’t find any documentation for this function in fastai’s docs. So I decided to look directly into the source to figure out how they are using DuckDuckGo without an API. Here is the source code:

def search_images_ddg(term, max_images=200):
    "Search for `term` with DuckDuckGo and return a unique urls of about `max_images` images"
    assert max_images<1000
    url = 'https://duckduckgo.com/'
    res = urlread(url,data={'q':term})
    searchObj = re.search(r'vqd=([\d-]+)\&', res)
    assert searchObj
    requestUrl = url + 'i.js'
    params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
    urls,data = set(),{'next':1}
    headers = dict(referer='https://duckduckgo.com/')
    while len(urls)<max_images and 'next' in data:
        try:
            res = urlread(requestUrl, data=params, headers=headers)
            data = json.loads(res) if res else {}
            urls.update(L(data['results']).itemgot('image'))
            requestUrl = url + data['next']
        except (URLError,HTTPError): pass
        time.sleep(1)
    return L(urls)[:max_images]

The source of the function is in the __init__.py file from the library. If you would like to inspect the source yourself, you can download it from here.

After initial inspection, it looks like they are using some type of internal API after capturing a state parameter from a request made to the search engine. Here is an overview of the algorithm:

  1. Make a request to https://duckduckgo.com/q=searchTerm.
  2. Use regex to search the response and find the value of a query parameter named vqd.
  3. Use the vqd and other parameters (for example o, f, l, next, …) as query string parameters to make a request to https://duckduckgo.com/i.js.
  4. The above returns an actual JSON string. The contents of the response looks like this: Screenshot of a formatted JSON string This is great and usable! You get a thumb URL (form Bing) and a full-sized image URL.

By default search_images_ddg() returns the full-sized image URLs as a list. After you have the URLs, you can use another fastai function called download_images() and save the files to your machine. It is available via the fastai.vision.utils module.

Implementing my own version of search_images_ddg() in PHP

In order to deepen my understanding of the search_images_ddg() function, I decided to implement a version of it in PHP and to publish the result as a Composer library.

Here is the final result https://github.com/yanosh-k/duckduckgo.

I think every developer should try to create this type of MVP libraries and publish them to the world. So if you feel inclined, please do create a flavor in your preferred language.

Getting URLs directly from my browser (In Browser Javascript)

I was wondering if I can find a way to use my browser to scrape the results page and get a simple JSON array that I can plug anywhere. This would allow me to visually skim over the images before downloading any of them.

And because I need the images for deep learning (meaning low resolution), the method described below downloads the “previews” of the images (these are the images shown when you click a result on DuckDuckGo image search page). It also has the added benefit of always using JPEG format and not having to deal with any files that can’t be downloaded (because of broken links).

This is how it’s done:

  1. Start a image search in your browser. For example https://duckduckgo.com/?t=ffab&q=dog+on+couche&iax=images&ia=images&iaf=type%3Aphoto
  2. Scroll to the bottom of the page. This will cause the browsers to load all images in the DOM.
  3. Open your DevTools. And execute the following snippet in the console:
    var imgs = $('.tile--img__img.js-lazyload')
    var imgsSrcs = [];
    for (var i = 0; i < imgs.length; ++i) {
     imgsSrcs.push(imgs[i].src)
    }
    imgsSrcs
    
  4. Right click on the output line of the list of URLs and select Copy Object. This will store a JSON array of URLs in your clipboard.
  5. Save this JSON to a file. For example my_cool_results.json
  6. Use wget, or any other tool to download the images. For example I may want to save all the images using the same name but with a different index (you can replace $num with $((149+$num)) if you want the indexing to start from 150):
    cat my_cool_results.json | jq '.[]' | tr -d '"' |  cat -n | while read num url; do wget -O "result_img_$num.jpg" "$url"; done
    

If you want to know what each of the commands from the above bash script does, you can use a tool called explainshell.com. The explainshell for the above code is here.