Best Practice: ElasticSearch/OpenSearch?

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:

2. Describe your environment:

  • OS Information:
    OpenSearch 2.0.1
    Server OS: Ubuntu 22.04 in LXC guest (16 core 64GB); LXC host: Debian Bookworm (16 core 64GB)
    Graylog: 5.0.3+a82acb2 (open/community edition)

  • Service logs, configurations, and environment variables:

# Path to a custom java executable. By default the java executable of the
# bundled JVM is used.
#JAVA=/usr/bin/java

# Default Java options for heap and garbage collection.
GRAYLOG_SERVER_JAVA_OPTS="-Xms31g -Xmx31g -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow"

# Avoid endless loop with some TLSv1.3 implementations.
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Djdk.tls.acknowledgeCloseNotify=true"

# Fix for log4j CVE-2021-44228
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Dlog4j2.formatMsgNoLookups=true"

# Pass some extra args to graylog-server. (i.e. "-d" to enable debug mode)
GRAYLOG_SERVER_ARGS=""

# Program that will be used to wrap the graylog-server command. Useful to
# support programs like authbind.
GRAYLOG_COMMAND_WRAPPER=""

Indices:

80 total shards

Outgoing traffic is between 16-18GB, daily

Indexing failures are through the rough mostly because:
a) OpenSearchException[OpenSearch exception [type=mapper_parsing_exception, reason=failed to parse field [ListBaseType] of type [long] in document with id 'c0b9fbc0-c8c5-11ed-895a-00163ef2bcdd'. Preview of field's value: 'GenericList']]; nested: OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=For input string: "GenericList"]];
or
b) OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=Limit of total fields [1000] has been exceeded]]

3. What steps have you already taken to try and solve the problem?
Just MacGuyver stuff. Panicked heap increases; inceease total field limits to 2000; forced restarts

4. How can the community help?
Basically I am just looking for suggestions on how to improve my situation. Issue a), above, has been commented on before. I just haven’t gotten my hands dirty, yet.

I think what I need is some guidance as to best practice for indices / shard management.

We are a single node, all-in-one (open/mongo/graylog all on one machine). I have access to heavier hitting hardware but I have to think better management would keep this instance tip-top.

All my knowledge with respect to database management has come by way of Graylog over the past year and half so I am still but a child in this area.

Thank you!

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

Hey @accidentaladmin

I throw my 2 cents in.

From what your showing, it looks good to me @accidentaladmin . Here is brief look at mine.
I tried to get all in the pic.
Inputs for different devices.

Each one of those Input’s has a stream attached that goes into its own index , which i seen you have done also with your “Linux” Index. What looks like your having trouble with is the index template being dynamic mapping. This means, if you dont already know, ES/OS will get the field like ListBaseType and auto configure it to a [string], so when you get a another message/log with that field/s ListBaseType and has something else besides a string (i.e., Long) it starts to bitch. Not sure if you have a Extractor or Pipeline creating that field called ListBaseType or if you let Opensearch do it for you.

As for you shards,

20 +20 + 12 +12 +20 = 84 INDEX Sets * 4 ( shard per) = 336 shards total

A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

So your pretty good there.

I have had this happen to myself. Here is the key, dont let opensearch run wild and try to keep a limit on whats being sent to your Graylog Server i.e., Winlogbeat, Filebeat, Nxlog, etc… Try to send only what you need. Divide up each set of devices to there own index set as much as possible. Sometime Databases/DNS servers, Firewalls and Switches maybe chatty. When your over the 1000 field mark you may have to wait till the index rotation is completed.

For example, when I had this happen we had Domain controler (Hyper-v virtual machine) sending 50 million logs per day, the issue was " the disk retried warning count was steady around 200", which means Hyper-v could not connect to the virtual disk in storage. The fix was shut down the machine , wait 10 seconts and start it back up, " Thanks Windows" :laughing:

I had a shit ton of errors and logs which created a mapping explosion in my environment.

“ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Limit of total fields [1000] has been exceeded]]”

Not only did I get an alert from Graylog but I check in real time.

[root@graylog opensearch]# curl -s -XGET es_host_ip:9200/graylog_1899/_mapping?pretty | grep type | wc -l
2143
[root@graylog opensearch]#

I had two choices either remove the older index sets, or wait for it to re-cycle out.

That all i have for ya

1 Like

This is awesome advice; thank you! I will run it through its paces :slight_smile:

Okay, so I followed your suggestion and broke my indices up:

so hopefully that clears the “1000 total fields exception”.

In so doing the above, I was finally able to track down the stinker thats causing the “ListBaseType” error:

This index is populated entirely by the Office365 & AzureAD collector found here @ddbnl

This is the extractor used for that input:

{
  "extractors": [
    {
      "title": "Audit Log Extractor",
      "extractor_type": "json",
      "converters": [],
      "order": 0,
      "cursor_strategy": "copy",
      "source_field": "message",
      "target_field": "",
      "extractor_config": {
        "flatten": true,
        "list_separator": ", ",
        "kv_separator": "=",
        "key_prefix": "",
        "key_separator": "_",
        "replace_key_whitespace": false,
        "key_whitespace_replacement": "_"
      },
      "condition_type": "none",
      "condition_value": ""
    }
  ],
  "version": "4.2.9"
}

Am I correct in thinking the answer lies somewhere in here? If so, honestly, I am not sure what to do to resolve. I assume “list_seperator” may have a part to play?

Thank you!

hey @accidentaladmin

sorry for the long reply, I went on Va-Ca

Awesome :+1:

TBH, I really havent worked with JSON extractors BUT if you can limit whats being sent? that would be one way. Or perhaps try to drop messages before there index, I.e., Pipeline.

Ya most likely that flatten json is the issue. The command is correct, but what you are doing basically is saying anything that MS sends as a json “key” create a field for that. And how many different fields they will throw at you… probably a lot.

And then every log type probably has its own all new set of fields etc.

If you did this in pipeline rules you could have more control over what you keep/drop. Also in a pipeline you could route each different log type from MS to a different stream/index so that they dont go over the 1k limit. Hopefully i explained that well :slight_smile:

1 Like

Unfortunately that did not make a whole lot of sense to me but that is due to my own ignorance on the subject haha

But it seems like your solution would address the >1000 fields issue (which I believe I solved by creating a dedicated index for the O365 plug-in). Would your solution address the:

OpenSearchException[OpenSearch exception [type=mapper_parsing_exception, reason=failed to parse field [ListBaseType] of type [long] in document with id 'c0b9fbc0-c8c5-11ed-895a-00163ef2bcdd'. Preview of field's value: 'GenericList']]; nested: OpenSearchException[OpenSearch exception [type=illegal_argument_exception, reason=For input string: "GenericList"]];

issue?

Thank you!

How dare you!
:wink:

Hope you had a restful vacation!

1 Like

So from that error message it looks like its trying to send string text perhaps to a numeric field type. What would be the expected values of that field?because somehow it seems to be getting string and numerical values being sent to the same field.

I quite agree but I have no idea how to solve the issue

Thank you, I had a blast. It was good to get away.

1 Like

Do you have any mesaages that have that field in them, and what values are in that field.

You can use exists:fieldname in the search to find those messages

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.