Moving from MediaWiki to SharePoint O365 - part 5

So we've pimped SharePoint O365 with some javascripts and copied all files and images from our MediaWiki to certain libraries in SharePoint. What remains is getting the textual content there. To do this, we're going to combine the formatted (HTML) content, which we rip from our local MediaWiki website using PHP with cURL, and the unformatted wiki-syntax content which we get from an xml-dump.

The xml-dump

In your MediaWiki installation, go to the maintenance folder. There you will find a script called dumpbackup.php. Let this script run with parameter --current (we're only interested in the most recent version of each article, not its revision history) and send its contents to an xml file. I'm working in ubuntu, so after cd-ing myself into /var/www/w/maintenance, I can just do:

1
php dumpBackup.php --current > backup-dump.xml

Then move the file to a newly created folder articles in your wikimigration directory. Create a folder articlesSplit, which is where we're going to store the articles in separate html-files. In the articles folder, create a new file called extractPages.php and paste this script into it. After updating the root path (on line 2) and the namespace of your MediaWiki installation on lines 49 - 51, save it and let it run:

1
C:\something\UniServerZ\core\php54\php.exe "C:\something\UniServerZ\www\wikimigration\articles\extractPages.php"

When the script is done, it will tell you how many pages it has processed. This number should be (more or less) the same as the number of pages on your MediaWiki, which you can check by creating an article and putting {{NUMBEROFPAGES}} in it (well, there'll be one more page on you MediaWiki now, i.e. the one you just created).

For a number of these pages, it doesn't make sense to copy them to SharePoint. E.g. category pages, since categories don't exist there. Or file pages, another thing SharePoint doesn't have. User pages. Talk pages. User talk pages. Pages in the namespace MediaWiki or yourInstallation. These pages are automatically skipped by extractPages.php, i.e. they will not be put in the articlesSplit directory.

Processing the content

All non-skipped articles are now in the articlesSplit directory in the form of separate html files. You may want to scroll through it to check that no unrelevant articles have made it through.
What is in these files again? The wiki-syntax content we extracted from the xml dump. Remember: we're going to use this in combination with the online-content. So now we need to process each individual page, compare it to the online content, update links, reformat math code, swap out source code, and finally write out the final version we want sent to SharePoint.

I've made a script that does all that. Create the following folders in the articles directory for this script to put its processed pages in:

  • articlesSplit-handled, which is where the original files from articlesSplit will be moved after the script is through with them. 
  • articlesToSend, which is where the processed output will be put for later uploading to SharePoint.
  • templates with subfolders caller and template
In the articles directory, create a file doRegularPages.php and copy this script into it. Update the URL of your MediaWiki installation on line 5, the URL of your SharePoint site on line 7 and the path of your project directory on line 8. Then run the script and wait for it to finish (this can take some time):

1
C:\something\UniServerZ\core\php54\php.exe "C:\something\UniServerZ\www\wikimigration\articles\doRegularPages.php"


Templates

The script will not send processed output for everything to do with templates to the articlesToSend folder, but to the caller and template folders instead. That is because template-code is not easy to swap out between the online content and the wiki-syntax content, and because template chaining and special code in templates (e.g. ParserFunctions) is not supported anyway.
So for templates and pages making use of templates, you will have to go through these folders manually to create the content to be sent to SharePoint. The script will, however, provide you with both the online content and the wikisyntax content per page, so you can easily compare the two and combine as necessary.
Once you're satisfied, copy the resulting files over to the articlesToSend folder, so they can be picked up there together with the other files, and be automatically sent to SharePoint. This last part is easy, and explained in the following post.

No comments:

Post a Comment