The export of dataset in a CSV file with graylog API is very slow in version 2.4.6 (docker cluster)


(Belhadi Rachid) #1

Hi everyone

When I run an export of a csv format dataset in production with the Graylog 2.4.6 API (docker cluster) I have a bitrate of 5KO / sec, the total volume of this dataset is about 5GB, on another machine with graylog 2.0.1 (standalone install) with the same query and the same data size I have 4000KO/sec bitrate, is there a configuration to modify to solve the problem on version 2.4.6?

thank you all !!!


(Philipp Ruland) #2

Hey @keaoner,

might this be your problem:

Greetings,
Philipp


(Belhadi Rachid) #3

thank you for the answer

I saw this post on the forum but I’m on a graylog installation under docker /etc/default/graylog-server does not exist, the ps aux command tells me the following launching parameters:

/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -Xms4g -Xmx8000m -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUn
loadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow -jar -Dlog4j.configurationFile=/usr/share/graylog/data/config/log4j2.xml -Djava.library.path=/usr/share/graylog/lib/sigar/ -Dgraylog2.installation_source=docker /usr/share/
graylog/graylog.jar server -f /usr/share/graylog/data/config/graylog.conf

is the Djavax.net.debug = all parameter loaded by default in version 2.4.6 of graylog? how can i disable it in docker?


(Philipp Ruland) #4

AFAIK this is not loaded by default. And since it is not included in your ps aux, it is also not enabled in your environment. Could you make another export to CSV and then get the Graylog logs shortly after that and post them here? Have a look through them first if you find something relevant, since you might have to alter the logging level of Graylog first in the System configuration menu for it to be more verbose :slight_smile:

Greetings,
Philipp


(Belhadi Rachid) #5

Hi Philipp,

Here is the log in debug mode, thanks your for your help

https://drive.google.com/file/d/1_3OIPDOGOtlW9-1kLayJXNaxmhauFkCR/view?usp=sharing


(Philipp Ruland) #6

Heyo :slight_smile:

Non related issues, that you should fix anyway:

2018-09-20 06:36:00,724 WARN : org.graylog2.inputs.codecs.GelfCodec - GELF message <705fd23f-bc9f-11e8-b49a-d6e55e6cc41f> is missing mandatory "host" field.

Over 106 thousand occurences… :smiley:

2018-09-20 09:37:28,840 ERROR: org.graylog2.inputs.converters.CsvConverter - Different number of columns in CSV data (22) and configured field names (20). Discarding input.

59 matches. Your loosing some of your logs with this :slight_smile:

Do you know at which time you issued the CSV export? Because else this will be searching the needle in a haystack…

Greetings,
Philipp


(Philipp Ruland) #7

Heyo @keaoner,

I just stumbled across this:

how’s the performance when querying Graylog itself? :slight_smile:

Greetings,
Philipp


(Belhadi Rachid) #9

Hi Philipp,

I will redo a csv download test and provide you with the logs from the beginning

I get to download the csv my problem is the download speed

thank you very much, the performance is good during a query in the web interface, however when downloading the result in CSV is very slow 5ko/s

Best regards!


(Belhadi Rachid) #10

Hi Philipp,

Here is my new debug log I started the csv file download on 2018-09-26 at 15H06

thanks again for your help

https://drive.google.com/file/d/10bVKzr1tf64bGxfi2QfuP0xqgL4MNkJi/view?usp=sharing


(Philipp Ruland) #11

Heyo :slight_smile:

What’s your timezone? The logs end at 14:03 :smiley: (They’re in UTC, so… :smiley:)

Greetings,
Philipp


(Belhadi Rachid) #12

Hi Philipp

My timezone is UTC+2

Thanks u


(Belhadi Rachid) #13

Hi Philipp,

additional information:

The production platform is composed of: Graylog 2.4.6 + Amazon Elastic Search 5.6 -> the csv download does not exceed 10 kb / s (with a cluster of 2 elastic search instance)

The platform of the POC is composed of: Graylog 2.0.2 + Elastic search 2.3 (AMI official version graylog 2.0.2) -> the download of the csv rises up to 4000 kb / s (with one elastic search instance)

the request we send contains many wildcards (example: AND NOT (* login * * ident * …)

The index containt 20000000 documents the size that varies from 12GB to 23GB

we have to keep the default values when creating the index-set

On the production platform Elastic search consumes a lot of CPU between 70 to 100% when downloading the csv

thx u very much


(Jan Doberstein) #14

@keaoner sad to say - but yes that happens. Let me tell you why.

In 2.0.2 Graylog was part of the Elasticsearch cluster, being a no-data and no-master node. Speaking the binary protocol with Elasticsearch like all other nodes.
Because of some decisions made by Elastic, Graylog was forced to move to the HTTP REST Interface. Graylog and any other Solution that uses Elasticsearch is now in the need speaking HTTPRest to Elasticsearch what gives you lot of overhead and the need that the server does more processing before it sends out the answer.

Graylog will try to get more speed out of it, but that is nothing we can squeeze out in minutes. In addition, we are in the hands of Elastic on this topic because they do not provide a solid stable client for Elasticsearch (what they promised to the world … ).

No excuse, but to explain the problem.


(Belhadi Rachid) #15

Hi Jan and Philipp

Thx u very much for your help
@jan : The explanation is clear, it’s a real shame we’ll try to find a solution internally.
Are there any plans to improve this problem of downloading CSV?

Just for your information with Graylog we have developed a user-based search term recommendation system, a statistical spelling checker based on our users, a document recommendation system based on our users’ usage. The POC works well we are moving into production.


(Jan Doberstein) #16

please see this graylog bug issue: https://github.com/Graylog2/graylog2-server/issues/5172


(system) #17

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.