To prepare the migration from bitbucket, I started to play a bit with its API to see what could be done. So far I've quickly draft two (ugly) python scripts to archive the forks and pull-requests. Since this is a one shot for us, I did not cared about robustness, safety, generality, beauty, etc.
** Forks **
The hg clones (history+checkout) represents 20GB, maybe 12GB if we remove the checkouts. Among the 460 forks, 214 seems to have no change at all (according to "hg out") and could be dropped. I don't know yet where to host them though.
This script can be ran incrementally.
** Pull-Requests **
Currently this script cannot be ran incrementally. You have to run it just before closing the respective repository!
Also, this script does not grab inline comments. Only the main discussions is archived. Those can be obtained by iterating over the "activity" pages, but I don't think that's worth the effort because they would be difficult to exploit anyway.
** hg to git **
As discussed in the other thread, if we switch from hg to git, then all hashes will have to be updated. Generating a map file is easy, and thus updating the links/hashes in bug comments and PR comments should not be too difficult (we only have to figure out the right regex to catch all variants).
However, updating the hashes within the commit messages will require to rewrite the whole history in a careful order. Does anyone here feels brave enough to write such a script? If not, I guess we could live with an online php script doing the hash conversion on demand. I don't think we'll have to follow such hashes so frequently.