User:Wizzy/CDTools

From Wikipedia, the free encyclopedia

Wikipedia:Version 1.0 Editorial Team is putting together a CD release. These are my scripts to assemble the CD from a list of articles, either from a web page or mediawiki markup. See also m:Static version tools. You can find a script that scans an article tree for bad words (usually from User:Lupin/badwords) at User:Wizzy/badwords.

Todo:-

  1. Templates (I have not figured this out yet)
  2. Front page (I think this should be a list of major categories ?)
  3. pictures (Even if I put these into the right place on apache, they do not appear in the wpcd articles)

After all this, the uncompressed HTML in the static dump is 91 Megabytes, from 2496 articles in biglist. A compressed tarball is 23 Meg.

If pictures are added, and jpeg-compressed, it should fill out nicely to 700Meg ??

Assistance appreciated for items on my Todo list. This would enable other people to create CDs of specialised content, like military history or mathematics, and update the CD from recent XML dumps. I have no method for selective excising of sections yet, that User:BozMo needs for his CD.

Categories[edit]

These are the sub-categories of 0.5 :-

{ category = Miscellaneous | Arts | Langlit | Philrelig | Everydaylife | Socsci | Geography | History | Engtech | Math | Natsci }

And should form a part of the front page.

Search[edit]

I have been looking at ksearch-client-side, a javascript search engine that runs in the browser. It is a javascript program that holds the search db itself - one-line summaries of the articles, and an inverted tree matching words back to articles. With some tweaking (cutting all articles to 3K, so it searches the lead paragraph only) I have reduced the javascript to a 'mere' 2.8Meg - still a bit big, but search is great.. It works fine in Opera, but some problems with Firefox, fixable with a hack.

If you want to try out the search on BozMo's CD, unpack ftp://ftp.wizzy.com/pub/wizzy/CDTools/BozMo-ksearch-client-side.tar (about 900k) at the base of the CD image, and go to ksearch-client-side/search.html Wizzy 19:30, 14 October 2006 (UTC)

Build your own CD from a list of articles[edit]

This makefile drives the process, and gets you from one step to the next. It is vaguely ordered from top to bottom, without hard dependencies, because some of them take a long time.

Sometimes there is a little perl script that I can't put inline because make thinks it owns do$$ar signs.

ftp://ftp.wizzy.com:/pub/wizzy/CDTools has this file and the other ones.

If you figure out a better way of doing the pictures, please send it back!

Makefile[edit]


###################################################################################################
# Makefile
# http://www.mediawiki.org/wiki/MWDumper

# pick out of page of Talk: page URLs
#! /usr/bin/perl -lane 'for $article (/title="Talk:([^"]+)/g) { print $article; } # file:cat-wpcd.pl'
wpcd:
        ./cat-wpcd.pl wpcd-* | sort -u > biglist
        java -jar /export/home/andyr/bin/mwdumper.jar --filter=exactlist:biglist enwiki-20060518-pages-articles.xml.bz2 | bzip2 > wpcd.xml.bz2

wpcd2:
# pick out of square brackets
#! /usr/bin/perl -lane 'for $article (/\[\[(.+?)\]\]/g) { print $article; }'
        ./cat-wpcd2.pl wpcd2.list | sort -u > biglist
        java -jar /export/home/andyr/bin/mwdumper.jar --filter=exactlist:biglist enwiki-20060518-pages-articles.xml.bz2 | bzip2 > wpcd.xml.bz2

# take out ==sections==

trim:
        bzcat wpcd.xml.bz2 | ./trim.pl | bzip2 > wpcd-trim.xml.bz2

# -Xss96k -Xmx128m
# drop whole database, rebuild from scratch
# cats.pl looks through articles for popular categories, and filters the category dump accordingly

database:
        echo 'drop database wikidb; create database wikidb;' | mysql -u root -ppassword
        echo 'use wikidb;' | cat - /usr/local/src/mediawiki-1.6.7/maintenance/tables.sql | mysql -u root -ppassword wikidb
        java -jar /export/home/andyr/bin/mwdumper.jar --format=sql:1.5 wpcd-trim.xml.bz2 | mysql -u root -ppassword wikidb
        zcat enwiki-20060518-categorylinks.sql.gz | ./cats.pl | mysql -u root -ppassword wikidb

#sql:
#       java -jar /export/home/andyr/bin/mwdumper.jar --format=sql:1.5 wpcd.xml.bz2 > sqldump

# * -d <dest>      destination directory
# * -s <start>     start ID
# * -e <end>       end ID
# * --images       only do image description pages
# * --categories   only do category pages
# * --redirects    only do redirects
# * --special      only do miscellaneous stuff
# * --force-copy   copy commons instead of symlink, needed for Wikimedia
# * --interlang    allow interlanguage links

Static:
        php-cgi /usr/local/src/mediawiki-1.6.7/maintenance/dumpHTML.php -d ${HOME}/wpcd/static

# look through all html files, fix up picture links, pull out thumbnail size, thumb it
# rummage in a flat directory of files, and 'convert' to web directory.
images:
        find static/? -name \*.html -print0 | xargs -0 ./images.pl

# build plucker database via webserver
pdb:
        plucker-build --bpp=8 --maxwidth=160 --stayonhost -f wpcd2 --maxdepth=4 http://localhost/static/index.html
######################################################################################