Page 1 of 1

Estimating storage space reqs for billions of headers

PostPosted: Thu Mar 01, 2018 8:35 pm
by br0adband
Ok so, never had any issues with Newsbin over the many years I've been using, one the best purchases I've ever made (and I've been using computers since the mid-1970s), it's an amazing piece of work so, thanks Quade. ;)

Anyway, I'm considering grabbing a few billion headers from several groups and I'm wondering if it's possible to some degree to get an estimate of how much storage space that might require. One group has just under 6 billion and while I know what I'm doing is kinda ridiculous nowadays with so many NZB search engines out there but I've always had fun just loading a crapload of headers and then browsing for stuff the good old fashioned way. I've discovered a lot of things in that respect that I would never have found any other way because it never crossed my mind to do a search for the topics/etc.

Is there any way to get a rough estimate of necessary space for that many headers from a Usenet newsgroup, or has anyone else around here that uses Newsbin ever allowed it to grab quite so many?

Thanks again for the great app and the support over the many years.

Have fun, always...
bb

Re: Estimating storage space reqs for billions of headers

PostPosted: Thu Mar 01, 2018 9:21 pm
by itimpi
I seem to remember it being something like 300 bytes per header which would mean you would need something like 1.8 TB for that many headers. Even if I am off slightly that gives you an idea of the scale!

Re: Estimating storage space reqs for billions of headers

PostPosted: Thu Mar 01, 2018 9:43 pm
by br0adband
I guess the main reason I asked is because I don't know if Newsbin Pro is doing any compression of the database files, I'm guessing there's some going on because in the past I've moved the data files because of an OS re-installation and attempted to compress them using WinRAR and there wasn't much actual compression going on with the resulting archive.

But if it's ~300 bytes per header then yeah, it'll be a huge amount of space and I don't have that kind of raw storage presently so, there goes that idea I suppose. My laptop has a 250GB in it and then I have a 640GB drive attached to the eSATA port on the dock.

Thanks for the response.

Re: Estimating storage space reqs for billions of headers

PostPosted: Thu Mar 01, 2018 11:40 pm
by Quade
The headers that end up in the DB3 are compressed. Depending on what version you're using, the headers downloaded but not yet put into the database might or might not be compressed.

It's a significant amount of space either way. I think with 100 gigs free you'd have more than enough free space. My whole data folder is something like 130 gigs.

Re: Estimating storage space reqs for billions of headers

PostPosted: Fri Mar 02, 2018 11:20 pm
by br0adband
Then I might just have a go at it. Is the header size basically the same uniformly? I mean I know some subject fields are longer than others so I presume there's some play there in terms of sizes. I was going to grab 1,000 days of headers - I use Newsdemon who's claiming 3486+ now - and then when that's done I was hoping to extrapolate the amount out, say if it turns out to be 25GB, for example, and I did the full pull I could expect close to 90GB (meaning 25GB x 3 for 3000 days, then 486 days being roughly 12.5GB for the grand total of 87-90GB).

Am I just thinking too much about this or what? :D

Re: Estimating storage space reqs for billions of headers

PostPosted: Sun Mar 04, 2018 9:24 am
by dexter
It depends on the group. It isn't always linear. Sometimes groups have a post flood that will focus several months worth of posts in to a week or so. But for the most part, your approach will get you in the ballpark.. within an order of magnitude. ;)

Re: Estimating storage space reqs for billions of headers

PostPosted: Mon Apr 09, 2018 10:15 am
by raw_toe
When newsbin is processing downloaded headers, is there a performance impact when downloading from a large group? What I mean is if a group has 50 million headers, and newsbin starts processing immediately, does the processing slow down because newsbin is still downloading? If so, would it be faster overall to complete the header download then have newsbin process all the files? Just curious how this part works.

Re: Estimating storage space reqs for billions of headers

PostPosted: Mon Apr 09, 2018 10:43 am
by Quade
Header downloads typically run much faster then import so, the download process isn't being bottle-necked by the processing.

Any time you use the disk for more than one thing at a time there is a performance impact. I'm just not sure it's significant here. One thing you might want to do is make the import folder in the data folder use Windows disk compression. This could reduce disk IO and reduce the space requirements.