Wiki Weekly Backup ZIP is really bad.

All about the OSDev Wiki. Discussions about the organization and general structure of articles and how to use the wiki. Request changes here if you don't know how to use the wiki.
Post Reply
GhostlyDeath
Posts: 6
Joined: Tue Jun 05, 2012 8:21 pm

Wiki Weekly Backup ZIP is really bad.

Post by GhostlyDeath »

That is http://files.osdev.org/osdev_wiki.zip
  • The dumped files lack the .htm extension.
  • The saved filenames are not safe for a FAT filesystem, so anything with ":" in it becomes TW1FI4~2, TVNKVB~3, TUV1H8~V, etc.
  • The pages are OK that they use relative links to other pages, however they directly reference the talk page for example as "Talk:UEFI", and to a browser that means a URI with the "talk" scheme where "UEFI" is the scheme specific part. Think of "mailto:[email protected]".
  • No User or File pages (file pages describe a file or image).
  • No files that are not images (if any actually exist).
Previously some years back, the copy of the Wiki appeared to be done nicely it had all the image pages, the user pages, no illegal filenames, all properly extensioned.

EDIT: I am currently in the process of writing a script which should take care of all of these considerations.

EDIT: Created the script, I would suggest running it in a safe place as a place user that cannot destroy anything to be on the safe side.

Code: Select all

begin-base64 644 getwiki.tgz
H4sIAAvdAFQAA+1XbXPbNhLOV+lXbBi5tlORlNw4d20tTz12nHYucTqpO/4Q
5TwQCYmISEIHgJbV6/33exaQbOXFvU+9m5vhfrBIEPuC3efZhWfSLdVcJbZ4
9KfJAPKXw0P/C/n09/k3zw4fDQ+eDQbDw+cHz54/Ggzx8PwRDf68kO6lsU4Y
okdGa/dH+/7T9/9TefI4nag6tUX3CZ3pZV1qkVtyhSSGBYk6p0rMpSXlqFaZ
pKk2VOpMlNRYMZN9cv5zqZ0lPSWnKklO6wT2rrSZWxLQrZ2sc5mTrulMTpSo
6aqQ8rcVbKh65t2hDHUuTE4Lkc1h2LKFn4VxSpTliibCsv50SoIyXVUc2FQ3
tbe5Cdc7NcrBG01W9LLQ1pWrMylcQUez8Jbz2w9GVlrniTaz4y6UThtjZO18
9N2zk8sXo6i3lwsn6eud8/2It7zGMSlXRmZOmxWOSIvG0VSVnJu6e3U2irTN
5Q3Hcd1jG1HXVBSbKcUxRb2rs6hbzWHg/hVWX8q1DTz/RCLnJFXaIB0LmeHk
lMuFK2yfJjITjZW00g3NoGR1JV3BySvVXEI7SdcqO98IBInTZIWokcdS1XOZ
px90UU8aM5MGG3IcP5W3MktRGpvmctLMksJVZTfL7+NbsqM4rpQxKHo8p3iB
VyM/IAe0+3T0tP9LcPnd0118kLdZ2eQy3iRJSTvavdtxvzf96PnTt8/fv7Ty
5bUUcYy7cYxEmRgYqt26KOsz9KmWS49pVS2MvpF5EiHwUlXKxQblHg0P/jqH
id6ezApN0ZmmWjvKjGQsMIC2ILCnEpmQvBXVopQJQBnRMSGXN2ndlOU+R1L/
uGXspFyKlSXJ7mtCtlPO+OdKL7Z0zrUB565UneulZeAvhFOTUnrQUC0qaT83
gBJZZ1TmYt4V+12jUi+lycCi/jJY69c607UzuuwLmym1HWnmGmDPd4Bf3776
zEPh3OK7NPWU8/llIqWvhaqvf0baGdi/AJ80MfAjjaU9BimdI3VTfbvvAcxn
KaWT4DbCmDZM72UhPZmBf2B4Ta41v3Nl53gQjgpxc8cPeLKraqJL7jNWY8mx
HQXCQ6lKEAehf/jOhPJJYjYwYenHy9ev0GtAEKowA0AwMnJRikz68sgbiRJz
d1onG568WUEzNRO1UxlxyDYzauG48Vw2pqYdbgw7B4d3fPURKwvsLLmF4smf
VIVD+SAyncvE94oppa5apL1egj3dKepECcVutUAI9DvNECDt7rwbxN+KeHoS
n7/fetzFBra8a9O/j5Nxmqbf23QnRSzpjL8tC0YMgJxzDmQ3193OE7qQUEHI
oVf4mHSZe2D1eZ0XmDO8wPsvtzYgsVYzpSRPDI0RYIOFxFMspb11YS0Htu/V
fY6QHmy/vr7mHwsqlmq6Cr4XEjRUKPlW1ZNu582rM+7IAZ09jj+6P+54712S
vh/vp+PxeIjDol93Ll5c/aHC1n5OE5KEcIJutxO0bNqD27QHU+kMFDj+qDi5
rj3OX3KOjG5mhcdZgKyYgQp9nwV2yUikPV0D6ks5CaDbZ8QUgEMYn5iqjLjH
9Eo5pAAT7zGdOJ8CP1E9frF7iem2nphrkgwPDwbBJCz61jYlKbICGZRr40NE
gVLkHEMwI5yu8IdKKazbZ2s1o3OpG9SWlUhMdONgEbWqVN24EGPQZrys2Zt0
L8450Q9gdcxD5V35fvzPQX84/lfPAzGjuESeT09H0SB6COVf1PwChNkK6ny7
MNQ7PaWvacjlDxVMEvKlpz3+hBT2LjDMaXj81UG3w4UB4aKtokZ0dAeWu2p3
O9XN5nnzNVR/MyJ4u7f5ABwCENYMC31lZ00AaKApovvg9sGM2hqdHycmv6f/
w2T+m8QG74chcuejc/rr29Em8k5QdaYBn1mtc/lmiynYukWUHc+KQArY/2nq
vaEB2zAW/ZHyfsAP5yScA96xHzh8B4uXbyIabSy/xzpvxk9ngvDneJiqYP2t
DE2G8x3fbDS8ASyGM/hnn/wN/y7QV7czzowJF6qH2udD+fsoD590jJ379nAX
XtgSYtpEcxouCrjfwrm85SFfsjPQ8QMPmZmWnj4V+oMnLTF0Bd+Nr3ntmtf8
RWw/SZKAsN3uES8c4wfx4scpV8rjtzKABck+SsNS9wh3QkE8nWP5j0bdjCIM
XFwGiohnrPP3ocH31Jhy9IDbCEbStZ+Jzlf4gUFYGUUPajzw4SgVbGxtJQ1n
2AWxfGrCbRMpO6lXuEM5g7na1HMegxuybbfbQK0J/jUg9CVPKGGyQuEmoBxf
WpGt3xSup7infuuvrwleo/VF9n/9T1YrrbTSSiuttNJKK6200korrbTSSiut
tNJKK6200sp/Uf4NtLOntAAoAAA=
====
User avatar
bluemoon
Member
Member
Posts: 1761
Joined: Wed Dec 01, 2010 3:41 am
Location: Hong Kong

Re: Wiki Weekly Backup ZIP is really bad.

Post by bluemoon »

I have the spare time to extract the attached script.

I have not look too deep into the script, but I have some suggestion: instead of remove the directory if already exists, I would instead just quit if it exist and assume it's already updated - Not everyone like rm -rf.

The OP's script, I suggest to run inside a container, jail or something, RUN AT YOUR OWN RISK.

Code: Select all

#!/bin/sh
# Downloads the wiki and makes it nice for local usage, takes lots of time too.
# Works as intended on Debian Wheezy using the standard packages.
# Partially based off a command found on the wiki.
# Written by GhostlyDeath <[email protected]>

# Current time
DATE="$(date +%F)"

# Make directory to put files in
WD="osdevwiki_$DATE"
rm -rf -- "$WD"
mkdir -- "$WD"

# Get files
# I added more special depths, because you get something like
# ./special%3arecentchangeslinked/johnburger%3ademo/exec/ints/debug.html
cd -- "$WD"
wget --mirror -k -p --reject '*=*,Special:*' --exclude-directories='Special:*,Special:*/*,Special:*/*/*,Special:*/*/*/*,Special:*/*/*/*/*,Special:*/*/*/*/*/*,Special:*/*/*/*/*/*/*' \
--user-agent="osdev-mirror, new and improved." --limit-rate=128k \
$(echo "Do not create host directory (i.e. example.com" > /dev/null) \
-nH \
$(echo "Always end in htm/html" > /dev/null) \
-E \
$(echo "Force Windows compatible file names" > /dev/null) \
--restrict-file-names=lowercase,windows,nocontrol,ascii \
$(echo "Actual wiki URL" > /dev/null) \
http://wiki.osdev.org/Main_Page

# Some browsers (like Firefox) get completely confused when there are files on the disk that have special
# symbols associated with them. So all of those links in HTML pages must be replaced in every single file
# with a gigantic sed script.
# Turn % to %25 because that is how it is used in the HTML code.
rm -f /tmp/$$.sed
find . -type f | grep '%[0-9a-fA-F][0-9a-fA-F]' | sed 's/^\.\///;s/%/%25/g' | while read line
do
	# Need to change the old name, to the new name
	# The old name also, needs to lose the . and / (confuses sed)
	# Turn %25 to ___ to simplify the operation on the disk.
	OLD="$(echo "$line" | sed 's/\([./]\)/\\\1/g')"
	NEW="$(echo "$line" | sed 's/\([/]\)/\\\1/g;s/%25/___/g')"

	echo "s/$OLD/$NEW/g" >> /tmp/$$.sed
done

# Go through all files again, and sed them (only web pages)
# This takes forever! Literally! At the time of this writing there are 1520 pages
# and if each one takes 1 second (on this atom at least) then it would take about
# 25 minutes for this to complete.
NF="$(find . -type f | grep '\.htm[l]\{0,1\}$' | wc -l)"
CC="0"
find . -type f | grep '\.htm[l]\{0,1\}$' | while read line
do
	CC="$(expr $CC + 1)"
	echo ".. $line ($CC of $NF)" 1>&2
	sed -f "/tmp/$$.sed" < "$line" > /tmp/$$
	mv /tmp/$$ "$line"
done
echo "Done" 1>&2

# Go through all files, and change every % to ___
# First rename directories
find . -type d | grep '%' | while read line
do
	# Keep changing % to ___
	CUR="$line"
	while true
	do
		TO="$(echo "$CUR" | sed 's/%/___/')"

		# If line has not changed, then done renaming
		if [ "$TO" = "$CUR" ]
		then
			break
		fi

		# Rename
		mv -v "$CUR" "$TO"
		CUR="$TO"
	done
done

# Now through all the files
find . -type f | grep '%' | while read line
do
	TO="$(echo "$line" | sed 's/%/___/g')"
	mv -v "$line" "$TO"
done

# Create an index html which just goes to main page (expanded_main_page.html)...
echo '
<html>
<head>
<title>Redirecting</title>
<meta http-equiv="refresh" content="0; url=expanded_main_page.html">
</head>
<body>
<a href="expanded_main_page.html">expanded_main_page.html</a>
</body>
</html>
' > index.html

# Any extra gunk
rm /tmp/$$ /tmp/$$.sed

# Go back out and archive it
cd ..
zip -r -9 "$WD.zip" "$WD"
Post Reply