Page 1 of 1

6.80 RC4 Header Storage

PostPosted: Sat Feb 03, 2018 6:36 pm
by Ortiz
The changelog for 6.80 RC4 lists the following item:
Header downloads no longer save the headers as GZ files. Instead they're saved as text. Leaving it up to the user to enable folder compression if they want compression.

This is a bit concerning to me. I have seen Newsbin data folders as large as 130 GB. I cannot imagine how large this would be if the data was not compressed.

"Leaving it up to the user to enable folder compression" sounds like it is a suggestion to enable NTFS compression. I made the mistake a few years ago of enabling NTFS folder compression on a file server that I managed. It crippled performance because NTFS compression results in very bad fragmentation.

https://arstechnica.com/civis/viewtopic ... 0&t=973465

Additionally, NTFS folder compression is not going to compress as well as gzip.

https://stackoverflow.com/questions/328 ... ndows-2012

Please reconsider this decision. Fully uncompressed would be a big waste of space and NTFS folder compression is best avoided.

Re: 6.80 RC4 Header Storage

PostPosted: Sat Feb 03, 2018 7:21 pm
by Quade
This is a bit concerning to me. I have seen Newsbin data folders as large as 130 GB. I cannot imagine how large this would be if the data was not compressed.


They're temporary files. Written and then removed after processing.

You're just turning compression on for the folder. Not the whole drive. If you're using an SSD, fragmentation isn't a thing anymore.

Re: 6.80 RC4 Header Storage

PostPosted: Sun Feb 04, 2018 11:24 am
by Ortiz
Oh. It seems I misunderstood. Which folder are the temporary files typically stored in?

Re: 6.80 RC4 Header Storage

PostPosted: Sun Feb 04, 2018 11:29 am
by tl
Quade wrote:
This is a bit concerning to me. I have seen Newsbin data folders as large as 130 GB. I cannot imagine how large this would be if the data was not compressed.

You're just turning compression on for the folder. Not the whole drive. If you're using an SSD, fragmentation isn't a thing anymore.

Even with folder compression enabled this change will cause significant increase in (temporary) disk space requirements.

The problem is that folder/file compression is much less efficient than any normal compression because it's done independently on each 64kB block of the file to allow random access, while gzip compress the whole file which result in much better compression.

I grabbed 1.61GB of gz files from 6.80RC3 to check how large they were uncompressed (11.9GB) and what Windows 10 Folder Compression would bring it down to (4.65GB). Or to put it differently, in the case I tested disabling gzip increase the disk space requirement by a factor of 7.4, enabling Folder Compression brings it down to "only" 2.9 times as much disk space. Obviously different groups may compress differently but I expect this will be reasonable numbers for most cases.

We can thus guesstimate that the 130GB of gzip'd headers would require ~964GB uncompressed or 374GB with Folder Compression enabled. This wasn't quite as bad I had expected but "near 3x" is definitely noticeable.

It will also cripple sequential access speed (how NBPro access it) on physical disks, so it's a big hit unless you store the data folders on SSD. To be fair many (not sure that "most" is though) probably do but SSD space is far more expensive than physical disks and as a result it's not uncommon to have limited space even if you do have it on SSD, as a result using 3x as much space even temporarily may well be a serious issue for many.

Re: 6.80 RC4 Header Storage

PostPosted: Sun Feb 04, 2018 3:42 pm
by dexter
Ortiz wrote:Oh. It seems I misunderstood. Which folder are the temporary files typically stored in?


The header data is stored in the Import folder under the Newsbin Data folder. Each file is removed as it is processed into the header database for the group. There is one header database for each group under the spool_v6 folder. It'll only backlog if you have a ton (i.e.100's) of groups or you are doing a download all on a very high traffic group. If you are just topping off headers every day, it really shouldn't accumulate much data. The number of header data files waiting to be imported is reflected in the cache display at the bottom of the Newsbin window. So if it says "Cache 400/400(10)" then there will be 10 data files sitting in the Import folder.

Re: 6.80 RC4 Header Storage

PostPosted: Tue Feb 06, 2018 12:29 am
by Ortiz
The header data is stored in the Import folder under the Newsbin Data folder...

Cool. Thanks. Is there some sort of advantage to having these files be uncompressed vs gzip? Does it assist with performance or something?

Re: 6.80 RC4 Header Storage

PostPosted: Tue Feb 06, 2018 12:39 am
by dexter
Yes, we found a performance improvement by not compressing/decompressing all the time. Also it helps avoid conflicts with antivirus software trying to scan the binary data.