Estimating storage space reqs for billions of headers

General NewsBin and Usenet related discussions. Make recommendations for additional specific discussion groups.

Moderators: Quade, dexter

Estimating storage space reqs for billions of headers

Postby br0adband » Thu Mar 01, 2018 8:35 pm

Ok so, never had any issues with Newsbin over the many years I've been using, one the best purchases I've ever made (and I've been using computers since the mid-1970s), it's an amazing piece of work so, thanks Quade. ;)

Anyway, I'm considering grabbing a few billion headers from several groups and I'm wondering if it's possible to some degree to get an estimate of how much storage space that might require. One group has just under 6 billion and while I know what I'm doing is kinda ridiculous nowadays with so many NZB search engines out there but I've always had fun just loading a crapload of headers and then browsing for stuff the good old fashioned way. I've discovered a lot of things in that respect that I would never have found any other way because it never crossed my mind to do a search for the topics/etc.

Is there any way to get a rough estimate of necessary space for that many headers from a Usenet newsgroup, or has anyone else around here that uses Newsbin ever allowed it to grab quite so many?

Thanks again for the great app and the support over the many years.

Have fun, always...
bb
The difference between genius and stupidity?
Genius has limits.
br0adband
Occasional Contributor
Occasional Contributor
 
Posts: 46
Joined: Fri Sep 16, 2005 11:02 pm
Location: Springfield MO

Registered Newsbin User since: 09/03/05

Re: Estimating storage space reqs for billions of headers

Postby itimpi » Thu Mar 01, 2018 9:21 pm

I seem to remember it being something like 300 bytes per header which would mean you would need something like 1.8 TB for that many headers. Even if I am off slightly that gives you an idea of the scale!
The Newsbin Online documentation
The Usenettools for tutorials, useful information and links
User avatar
itimpi
Elite NewsBin User
Elite NewsBin User
 
Posts: 12604
Joined: Sat Mar 16, 2002 7:11 am
Location: UK

Registered Newsbin User since: 03/28/03

Re: Estimating storage space reqs for billions of headers

Postby br0adband » Thu Mar 01, 2018 9:43 pm

I guess the main reason I asked is because I don't know if Newsbin Pro is doing any compression of the database files, I'm guessing there's some going on because in the past I've moved the data files because of an OS re-installation and attempted to compress them using WinRAR and there wasn't much actual compression going on with the resulting archive.

But if it's ~300 bytes per header then yeah, it'll be a huge amount of space and I don't have that kind of raw storage presently so, there goes that idea I suppose. My laptop has a 250GB in it and then I have a 640GB drive attached to the eSATA port on the dock.

Thanks for the response.
The difference between genius and stupidity?
Genius has limits.
br0adband
Occasional Contributor
Occasional Contributor
 
Posts: 46
Joined: Fri Sep 16, 2005 11:02 pm
Location: Springfield MO

Registered Newsbin User since: 09/03/05

Re: Estimating storage space reqs for billions of headers

Postby Quade » Thu Mar 01, 2018 11:40 pm

The headers that end up in the DB3 are compressed. Depending on what version you're using, the headers downloaded but not yet put into the database might or might not be compressed.

It's a significant amount of space either way. I think with 100 gigs free you'd have more than enough free space. My whole data folder is something like 130 gigs.
User avatar
Quade
Eternal n00b
Eternal n00b
 
Posts: 44867
Joined: Sat May 19, 2001 12:41 am
Location: Virginia, US

Registered Newsbin User since: 10/24/97

Re: Estimating storage space reqs for billions of headers

Postby br0adband » Fri Mar 02, 2018 11:20 pm

Then I might just have a go at it. Is the header size basically the same uniformly? I mean I know some subject fields are longer than others so I presume there's some play there in terms of sizes. I was going to grab 1,000 days of headers - I use Newsdemon who's claiming 3486+ now - and then when that's done I was hoping to extrapolate the amount out, say if it turns out to be 25GB, for example, and I did the full pull I could expect close to 90GB (meaning 25GB x 3 for 3000 days, then 486 days being roughly 12.5GB for the grand total of 87-90GB).

Am I just thinking too much about this or what? :D
The difference between genius and stupidity?
Genius has limits.
br0adband
Occasional Contributor
Occasional Contributor
 
Posts: 46
Joined: Fri Sep 16, 2005 11:02 pm
Location: Springfield MO

Registered Newsbin User since: 09/03/05

Re: Estimating storage space reqs for billions of headers

Postby dexter » Sun Mar 04, 2018 9:24 am

It depends on the group. It isn't always linear. Sometimes groups have a post flood that will focus several months worth of posts in to a week or so. But for the most part, your approach will get you in the ballpark.. within an order of magnitude. ;)
User avatar
dexter
Site Admin
Site Admin
 
Posts: 9511
Joined: Fri May 18, 2001 3:50 pm
Location: Northern Virginia, US

Registered Newsbin User since: 10/24/97

Re: Estimating storage space reqs for billions of headers

Postby raw_toe » Mon Apr 09, 2018 10:15 am

When newsbin is processing downloaded headers, is there a performance impact when downloading from a large group? What I mean is if a group has 50 million headers, and newsbin starts processing immediately, does the processing slow down because newsbin is still downloading? If so, would it be faster overall to complete the header download then have newsbin process all the files? Just curious how this part works.
raw_toe
Occasional Contributor
Occasional Contributor
 
Posts: 17
Joined: Mon Aug 09, 2010 5:57 pm

Registered Newsbin User since: 05/06/03

Re: Estimating storage space reqs for billions of headers

Postby Quade » Mon Apr 09, 2018 10:43 am

Header downloads typically run much faster then import so, the download process isn't being bottle-necked by the processing.

Any time you use the disk for more than one thing at a time there is a performance impact. I'm just not sure it's significant here. One thing you might want to do is make the import folder in the data folder use Windows disk compression. This could reduce disk IO and reduce the space requirements.
User avatar
Quade
Eternal n00b
Eternal n00b
 
Posts: 44867
Joined: Sat May 19, 2001 12:41 am
Location: Virginia, US

Registered Newsbin User since: 10/24/97


Return to General Discussion

Who is online

Users browsing this forum: No registered users and 2 guests

cron