Today I overcame a sizeable hurdle in developing my web application for researching Amazon's recommendation system. Apologies if this post is too technical; I barely understand what I've been doing myself – certainly not enough to put it in plain English.

For the last week, I have been attempting to make it so that submitting the Django form below causes the server to log into Amazon and search for the listed items using the scraping/browser-automation library I wrote called snu-snu. This is by no means a flashy front-end but it means I can test the basic functionality of what I'm building. I had partial success with this working directly on the computer I'm using for development. However, I couldn't be certain that what would work on one machine would work on Amazon's or any other servers.

The solution I was advised to use was to wrap everything using Docker so that the environment in which I deployed my application would be a carbon copy of my development environment. Based on my limited understanding, Docker is a tool that systematically builds individual instances of Linux (called containers) for different parts of a project. For example, one might have separate containers for web servers such as Nginx or Apache, databases such as MySQL or PostgreSQL and content management systems like Wordpress or Joomla. These containers aren't as isolated as virtual machines – they aren't each allocated chunks of system resources, for example – but still only interact with each-other inasmuch as is required. What makes them very useful is that their content and configuration are defined in human-readable Dockerfiles that can be deployed anywhere and produce identical containers when built.

The Docker setup I'm using combined the following:

  • Postgres: a database.

  • Celery: a system that enables tasks to be queued and scheduled so that they don't interfere with a user's interaction with a website. They can continue using the site while time-consuming code is executed.

  • Rabbitmq: a message-broking system used by Celery to schedule and manage tasks.

  • Django: a Python-based web application framework.

  • Nginx: a web server.

I needed to make it so that a task triggered by a user's interaction with Django would be queued in Celery and then be carried out by snu-snu using selenium and a headless browser (i.e. one configured not to output to a screen). This last part I had to figure out myself, or so I thought.

I spent a lot of time trying to modify the Dockerfile for the Django container so that the webdriver would be accessible to the code. Even though I made some progress, this was an appalling way to go about solving the problem; even if I had managed to cobble something together, I would not have received updates from any of the repositories I was pillaging. After failing at this for many hours, I decided to go about it another way. As a lot of docker containers are based on the stable and minimal Debian Linux distribution, I created a Debian virtual environment and started trying to manually get selenium working with a headless browser. This only worked when on top of installing a virtual framebuffer (which simulates outputting to a screen) I installed display managers and desktop environments. I stopped at this point as these pieces of software have no place on a server.

I was lucky to be dissuaded from the questionable path I set out on, though I didn't feel it. I read more documentation and decided to create a container just for the headless browser bases on an image created by someone else. I had no idea how my python code in Django would communicate with this new container. Luckily I had help from someone more experienced who showed me how to set up the container with an image and access a a selenium-controlled browser remotely. Even then, the image I chose didn't work.

After a little more searching I found an image called selenium/standalone-chrome which was maintained by SeleniumHQ themselves. All this required was that port 4444 was opened on this image and Django could talk to it. Four lines of markup in a docker-compose file was all it took to arrive at this simple and perfectly encapsulated solution. The trick with coding seems to be to know the four lines that work from the near-infinite number of combinations of lines that don't

The below screenshot shows the server logs for when the Amazon searches specified in the form above were carried out:

As you can see after about 13 minutes, the task had been successfully completed by Celery. If you look at lines further up, you can see the container headless_chrome_1 interacting with and the output from snu-snu being displayed via the container celery_1.

This post may be gibberish to you but I'm feeling optimistic as I have a working prototype for controlling snu-snu via a web interface. My next post will probably deal with my prototype for displaying recommendation scraped from Amazon.