Earlier this month I created the post Previous Content Restored after I had finally got my existing content back online. Since then I’ve not done anything besides read about Jekyll online.
I certainly over complicated the process because I worry too much about various things. You can read about why I transferred from Drupal in the post Restoring My Website.
Before I begin I want to add that my laptop is running Windows 8.1. I’ve used versions of Linux on and off for long enough to be comfortable using a command line.
Here’s a short summary of what I did to transfer my website:
- Setup Ubuntu Server 14.04 using a Virtual Machine in Virtual Box
- Hosted my Druapl website in Ubuntu Server using a standard LAMP setup
- I used Ubuntu Server so I could easily setup a LAMP server
- Some minor tweaks were made in Drupal to hide some pages and remove active content
- HTTrack Website Copier used to get HTML copy of Drupal website
- PHP DOMDocument used to strip head, JS and other content
- PHP Tidy used to output nicely formatted XHTML
- Python library Beautiful Soup used to transform the HTML further
- Notepad++ & Sublime Text 3 using regex patterns to find and remove some content
- Some manual updates made using a text editor
- phpMyAdmin used to export node data from Drupal
- Python used to convert standard HTML into HTML with Jekyll front matter and additional cleanup
- wget on Ubuntu Desktop used to check for broken internal links
- Final changes made manually with a text editor
Listed like that it seems like a lot of steps. While the goal was to get setup in Jekyll I used this as an opportunity to play around with a few different methods of manipulating HTML content. I worked on this on and off for a week or so.
Steps 1 to 4 where done on the Ubuntu Server VM. Steps 5 to 7 were done on my laptop. The data extract for step 7 was done in the Ubuntu Server VM but the data was processed on my laptop. Steps 8 & 9 were done on my laptop. Step 10 was on the Ubuntu Desktop VM though the data was checked on my laptop and finally step 11 was on my laptop.
At some point during this work I setup a Ubuntu Desktop VM in Virtual Box. That’s what I’m currently using to run and test Jekyll. I could probably have downloaded a wget version for Windows but since I had Ubuntu available there was no point.
Since I’m familiar with VirtualBox and Ubuntu I decided to run Jekyll like that rather than risk possible issues running on Windows while that is not officially supported and I’m not familiar with Ruby yet.
Basically this whole process took a copy of the content from the original Drupal website and prepared it for use in Jekyll.
Over the different methods I used to process HTML I think Beautiful Soup using Python was the easiest. It did take me a couple of attempts at some points to figure out the correct options to avoid Unicode issues. I eventually got it setup so all the content (including the Japanese characters) was correctly processed and saved.
Maybe one day I will extract some of the Python code and host that on GitHub.