Conversation

I want to process 4714 message-id's to collect some patch stats for the last year. I could collect with which uses the archive but I wonder does it cache or will I end up with lots of redundant mbx files? Worse still would I just be hammering an overloaded server? I guess I should check out the docs.

3
0
0
@stsquad @monsieuricon would know for sure. IIUC there's some caching, but you can also just clone the mailing list locally (public-inbox uses git as a storage backend).
0
0
0
@stsquad I'd need to know what kind of info you're looking for from those 4714 messages, before I can really give you any advise. The kind of caching b4 does is unlikely to be useful in this task.
1
0
0

@stsquad Why don't you just download the 12 monthly mbox's? (Well qemu has them, not sure everything else)

0
0
0

@monsieuricon for every Message-Id in the commits for the last year I want to fetch the thread they are from and analyse the patch rev count and meta-data in the commit message.

1
0
0
@stsquad Yes, you can do that with `b4 mbox [msgid]` to fetch whole threads, but you will indeed have redundant mbx files unless you first analyze each mbx file and filter out msgid's which you already have.
1
0
0
@stsquad In other words, grab the first msgid in your list, analyze the mbx file to get all retrieved msgid's and then compare each next msgid you try to retrieve against your set of "already have" messages. That will reduce the number of queries and the number of redundant local mbx files you'll end up with.
1
0
0

@monsieuricon it sounds like I should just use public-inbox and to batch the queries and sync into a de-duped Maildir archive. Isn't it better suited to this sort of archive delving?

1
0
0
@stsquad sure, lei will do what you need here -- you probably want to use the latest master version, because it has a lot of maildir dedupe improvements.
0
0
0