We’ve got our Graylog Enterprise instance set up with AWS ElasticSearch service as the index and an S3 bucket for archive storage (indices rotated once a day). This unfortunately means that when an archive is done, data has to be downloaded from AWS, exported and compressed, and then uploaded to AWS. This works relatively well except for the fact that the bandwidth used exceeds our contracted rate for a couple of hours, which results in significant overage charges.
To remedy that, I’ve throttled the bandwidth on the log server with wondershaper/tc which succeeds in limiting the bandwidth. However, every archive that I run since the change fails to complete. It only writes one segment and then quits. I see this in the server log:
2018-02-05T13:48:14.673-06:00 INFO [RollingFileSegmentOutputStream] Creating new segment: /opt/s3/graylog-archives/graylog_39-20180205-163344-372/archive-segment-1.gz
2018-02-05T13:48:25.581-06:00 ERROR [ArchiveCreateJob] Archived only 4421000 out of 8508322 documents, not deleting/closing index graylog_39
2018-02-05T13:48:25.593-06:00 INFO [SystemJobManager] SystemJob <55148030-0a92-11e8-91ec-fee5de21aa98> [org.graylog.plugins.archive.job.ArchiveCreateSystemJob] finished in 11681221ms.
could you please be a little more verbose on your setup. What Versions did you use? How did you configured them, how much did you throttle them? Where did you exactly throttle what kind of connection.
Currently we the above Information we are not able to give any help.
I’m running Graylog 2.4.3 with the same version for the Enterprise plugins. It’s running on CentOS 7.3 and I’m using wondershaper 1.3 with the following config:
[wondershaper]
# Adapter
#
IFACE="ens160"
# Download rate in Kbps
#
DSPEED="18432"
# Upload rate in Kbps
#
USPEED="18432"
That will limit the up and down throughput for the ens160 adapter (the only network adapter in the system) to 18Mbps (our contracted rate is 20Mbps with bursts allowed to 40Mbps).
The Graylog server uses the AWS Elasticsearch service (three r4.xlarge.elasticsearch instances) connected over a VPC. Indices are rotated daily and each index is somewhere between 8 and 15G (usually closer to 12-13) in size comprised of 6-10M messages. 4 shards per index, 0 index replicas, 1 ES segment per index. We keep 30 days worth of indices (though it’s set to 45 now while we’re working on this problem) and indices are deleted after they’re archived. The archives are saved to an S3 bucket via a fuse.s3fs mount. Archive max segment size 500M, gzip compression, CRC32 checksum.
that does not look like a bug, as of the technical way the archiving is working that could happen because of the slow storage, you have by design.
If you are not able extend the ressources and need some additional professional service that helps you with your use case, please get in contact with the Graylog Company.
I don’t see how this is not a bug. The failures happen after one segment is written, so it’s clearly fast enough to be able to write one of the files. And 18Mbps is not that slow. Is there not a way to increase the logging level so we can see what’s happening before the ERROR entry?
Does the same error occur when you write to the local/ephemeral disk of the EC2 instance?
If yes, I’d consider it a bug. If not, I’d recommend not throttling your network interface so much that it stifles the backup process by effectively reducing the write performance (to S3) to a crawl.
It’s shared between the connection to the AWS Elasticsearch service and S3, correct? So it effectively halves the bandwidth and only if you don’t count any overhead and under optimal conditions.
Archive to S3 with throttling set to 18Mbps down/40Mbps up - fails
Manually copy archive directory to S3 mount manually with 18/18 throttle - succeeds
The latter two say to me that there’s no issue with copying files to the S3 mount with the throttle active. However, I think I may have an idea for how to work around this issue. I can set the destination for archives to a local directory then move them to the S3 mount. Of course, Graylog will not be able to find the archive any more. So, is there a way to edit the “segment directory” value for a particular archive?