Recently, I have encountered a dilemma regarding my personal blog. I have it hosted by a local blog host provider. It was great staying there and blogging there when there was still this very tight-knit community of bloggers. From time to time, we'd all meet up and just express our thoughts in person. The providers were really great with hearing out their users' thoughts, wishes including rants and raves. They were really good with what they were doing (up to some point in time).

Here are a couple of my raves about their service and the community:

  • Most of the members were real people. Not just an avenue for spammers and anonymous bloggers
  • People of that community were intelligent people. Its nice to have some good conversations with them.
  • There's always a superhero in the forum who always tries to help (though he's not connected with the providers themselves)
  • Even their higher management mingle with the people of their community
  • Customer support is quick to reply
And some of my rants too:
  • Sometimes, they forget about threads/issues
  • Their customer support is very poor in their knowledge of how the system works
  • Their customer support cannot provide tech support
  • No integration to offline blogging software (though Wordpress supports this)
  • I found out, (while writing this script), that they had a lot of substandard tweaks (especially in the markup)
And yes, it might be too obvious for some, I have experienced all of my rants in the past few weeks that have passed. I was under the intent of making an offline backup of my blog (hosted with them). And with this, I got in touch with their customer support. I have been courteous enough, patient and understanding that my request might come in weird, or exceptional to that case. The I.ph platform is really made of Wordpress engine, and that they tweaked this engine to conform to some of their requirements in providing their own bloghost. I knew for a fact that a backup could easily have been given only to special requests like mine. But this was not my case.

It seemed to me that they were reluctant, if not totally disagreeing with the fact that they'd release a backup of my blog.. for what reason, I don't know why. Maybe because their tweaks have made Wordpress engine's support for one click backup unusable?? I tried to reason out by giving them clarification of what I needed from them, but they made me wait in vain. Alas, my patience broke out when I received an email from their customer support saying..

Of course I know my own RSS feed, don't I? But that was not what was asked for! At rage, I decided to take matters into my own fiery hands. I couldn't say well enough that that email only made me furious.. well, it also made me excited and happy! Happy that I could once again write my own code to reach my own goal! Since I was bent on getting my own backup in an easy way, I looked up some reference online on whether somebody else could have done or wrote something similar to what I had in mind. I found this, but it seems to be dead as of the time of writing.

Now, being the Ruby lover that I am, I decided to write a Ruby script that would scrape my content and push it onto my new Wordpress blog. There are a couple of assumptions/caveats for this script:

  • You can scrape off content from only YOUR OWN BLOG. Please don't use this to steal other people's content. I am not liable for any online content theft from the use of this script.
  • You must be able to understand the structure of the blog you are scraping. You should know where the "excerpt", "main body", "post date", "post author", etc. info are located from the markup/source of the blog you are scraping.
  • You must have the credentials of the blog where you want to push the scraped content.
  • If in case you need something else, or some more tweaking to this script, you must know Ruby, or maybe you can drop me a line and I'll see how I can help you.
  • This script uses Atom tools, although there are other gems available, this is what I chose to use.
  • Turn on Atom publishing in your Wordpress blog.

Ready?

  • First, of course, have a working Ruby setup.
  • Next, make sure you have the following gems: (1) Hpricot and (2) Atom
sudo gem install hpricot
sudo gem install atom-tools
  • Test the following requires in your irb. If it doesn't raise an error, then you're good to go. :)
require 'rubygems'
require 'open-uri'
require 'net/http'
require 'hpricot'
require 'atom/entry'
require 'atom/collection'
  • Secure all the variables you'll need:
wp_blog_host = "livinglife.sweetperceptions.com"  #the wordpress blog you want to post to
wp_blog_uri = "http://#{wp_blog_host}"  #making it browseable
wp_base = "http://#{wp_blog_host}/wp-app.php"  #appending the base for publishing posts
wp_blog_username = "myusername"  #username to the wordpress blog you want to post to
wp_blog_password = "mypassword"  #password to the wordpress blog you want to post to
your_blog_source = "http://sweetperceptions.i.ph"  #the page you want to scrape content from
which_pages = 1..19  #put the pages here, if the content source will need this, see how I used it
  • You must have registered categories on your recipient wordpress blog to be able to automatically categorize your posts. If your blog source already have categories, then you can use these exact names for the migration. If you only have tags, I can still help you out (as this was my trouble too). You'll need to declare your registered categories as such (these are my own categories):
registered_categories = ["About me", "Artistry", "Cool Finds", "Dreams", "Events", "Health and Beauty", "Horoscope", "Living Life", "Meme", "Movies", "Music", "Notes", "Pet Love", "Quotes", "Random thoughts", "Stories to share", "Techie", "Travel"]
  • If these new registered categories do not match those that you currently have in your blog source, then explicitly setting "synonym categories" would help. For those who did not categorize their posts, and solely rely on their tags, this could be your best avenue for categorizing your posts automatically. Try to categorize which tags (and old categories) would fall into your new categories.
synonym_categories = {
  "About me" => ["me"], 
  "Artistry" => ["poem"],
  "Cool Finds" => ["cool"],
  "Dreams" => ["dream","dreams"],
  "Events" => ["event", "bday", "birthday", "Christmas", "New year", "new-year", "celebration"],
  "Health and Beauty" => ["health", "sickness", "headache", "fever", "cancer"],
  "Horoscope" => ["cookie", "fortune", "horoscope", "astrology", "psych"],
  "Living Life" => ["life", "kalokohan"],
  "Meme" => ["meme"],
  "Movies" => ["hollywood", "movie", "movies", "movie-lines", "happy-feet"],
  "Music" => ["song", "songs", "singer", "music", "ost"],
  "Notes" => ["notes"],
  "Pet Love" => ["pet", "cat", "dog", "animal", "animals", "pets"],
  "Quotes" => ["quote", "quotes"],
  "Random thoughts" => ["thought", "thoughts", "think", "logic"],
  "Stories to share" => ["story", "stories", "adventure"],
  "Techie" => ["tech", "techie", "work", "web2.0", "development", "software", "online", "skype", "pc"],
  "Travel" => ["philippines", "travel", "province"],
}
  • When everything here is set, you'll need to understand that if you don't have your own sitemap of posts' links, you'll need to extract them manually by visiting all pages of your blog_source to get their urls. This might take some time. Its better if you have a list of your own posts' urls via a sitemap. If not, then you'll have to set your blog source pages like this (mine had the format http://sweetperceptions.i.ph/page/N):
which_pages.each do |page|
  from = Hpricot(open(your_blog_source + "/page/#{page.to_s}/"))
  urls << (from/"h3[@class='entrytitle']/a").collect{|x| x['href']}
end
  • Or, use your links from a sitemap. I only used text format for my own use. To do this,
urls = File.readlines(urls_to_import).map { |line| line.chomp }
Now, let's take a deep breath. The next steps would be more about Hpricot, and parsing your own document to extract the necessary information. I'll walk you through how I got mine, but then you'll have to construct your own for each section necessary. wink I'll show you the snippet of the section involved, followed by the Hpricot code to get it. I used Firebug to help me with this, by the way.

This is an example of one post in my blog source:

  • Get the title of your post. Each blog entry in my blog source is enclosed in a div which contains an h3 tag with class 'entrytitle'. The actual title is enclosed in a link. You always call the inner_html of the element node you're looking at to get the text.
  title = (CGI::unescapeHTML((doc/"div/h3[@class='entrytitle']/a").inner_html.strip)).gsub(/\r\n/, '')
  • I explicitly set the author to my name. Next, we get the time stamp. Some would have the timestamp displayed, while the others won't. If you want to capture the exact timestamp of the post, better to have this in the script. Mine was displayed at the bottom of the blog entry inside a div with class 'meta-post'. Since this entire thing contains other information too, I used regular expression to match the format of my timestamp. NOTE: Others would have a different timestamp format. Please adjust as necessary.
  • The date is still out there. The date stamp of my post is found inside a div with a span whose class is called 'date'.
  datestr = ((doc/"div/span[@class='date']").inner_html.strip).gsub(/\r\n/, '')
  • After we have the time and the date stamp, we concatenate them to form a string that would be accepted as a valid date via Atom publishing. NOTE: You may need to adjust your timezone calculations for this. Mine resolves to be -0500.
  datestr = datestr + " " + timestr
  datestr = DateTime.parse(datestr).strftime('%a, %-d %b %Y %T -0500')
  • Next comes the tricky part: Categorizing your post! If you were able to correctly assemble your registered categories and synonym categories, this would be a breeze. Well, at any rate, you can keep on emptying your blog for the posts and redoing them again as frequent as you have/want to. Find where your tags are enclosed. Mine is found inside a div with class 'tag-list'. If your tags match the exact name of a category, then its automatically added to your filtered tags. Next, your synonym categories would be parsed to see if your tags can be used to classify to your new registered category.
  filtered_tags = []
  tags = (doc/"div[@class='tag-list']/a").collect{|x| x.inner_html}

rule 1 -> exact match

filtered_tags << tags.collect{|x| x if registered_categories.include?(x)}.compact

rule 2 -> synonyms

synonym_categories.keys.each do |syn|
filtered_tags << tags.collect{|x| syn if (synonym_categories[syn]).include?(x)}.compact
end

tags = filtered_tags.flatten.compact.uniq.join(',')

  • Now to get your entire post. Yipee! You need to find a distinguishable id attached to every post. There really should be one unique id for each post. They usually look like 'post-id' or any flavors of that. Mine was enclosed in a div with id called 'postentry-{id}'. You'll need to get the this id!
  # Get your contents by finding all paras in the entry post
  entry_id = "postentry-#{doc.at("div[@class='blog']")['id'].split('-').last}"

Get the main body content

contents = (doc/"##{entry_id}")

  • We are almost done. With this, there could be some parts of the body that you'd want to remove. You can do this by getting to the element via Hpricot and remove each of them.
  # Remove unneeded elements  
  (doc/"##{entry_id}/h3").remove  
  (doc/"##{entry_id}/span[@class='date']").remove  
  (doc/"##{entry_id}/div[@class='tag-list']").remove
  (doc/"##{entry_id}/div[@class='meta-post']").remove
  • Lastly, assemble your element and post away!
  # Atom Author element
  author = Atom::Author.new
  author.name = author
  author.uri = wp_blog_uri

Atom Entry element

entry = Atom::Entry.new
entry.title = title
entry.summary = hExcerpt
entry.content = content
entry.content.type = "html"
entry.published = datestr
entry.updated = datestr
entry.tag_with(tags, ',')
entry.authors << author

req = Atom::HTTP.new
req.user = wp_blog_username
req.pass = wp_blog_password
req.always_auth = :basic

Atom Collection

c = Atom::Collection.new(wp_base + "/posts", req)

res = c.post! entry


Did you enjoy this? I knew I did the first time! It feels great not having to manually copy and paste my content from the old blog source to the new one. Best of it all, I was able to keep my categories. I'm still looking at how to transfer all of my tags into the new one along with the comments, but I'm not yet successful with doing it as of the moment. If you'd like a full copy of this script, you can find it here.

Goodluck to your migration! I hope this helped.