Too many fields

We’ve got an index that has 1008 fields, I’ve tried separating out some of the various logs (all are Windows Application and Services Logs) into separate indices but I still have 1008 fields so I’ve obviously picked logs that shared fields with other logs.

How can I see a breakdown of field names/number of logs/entries per field so I can identify busy/unusual fieldnames, then start targeting the logs with those field names to split them out into their own indices? I’ve tried various Elasticsearch commands but don’t seem to be able to get this data (I’ve check their documentation and can’t find anything that does what I need).

There are a hundred or so Windows Application and Services Logs so I can’t create an Index for each.

To summarise, I know what the issue is, I know how to solve it but I need some help identifying which fields I actually have in Elasticsearch and how “busy” those fields are so I can put the solution in place (split the logs into appropriate indices).

This is in Graylog 3.3.16 with associated Elasticsearch version.

As far as I can find there are not any elasticsearch APIs that can answer this* (*they later added some but these are not available in elasticsearch 7.10 or older).

I understand what you are asking, but if your desired outcome is to keep your field count below 1024, my recommendation is to separate each log source type into its own stream/index set. For example logs collected via windows would get a dedicated stream/index set.

Hope that helps.

Hi @drewmiranda-gl

The limit in Graylog is 1000 fields per index. So we need to stay under that.

These logs are all Windows logs, they have already been separated once from other Windows logs and we still have over 1000 fields.

Having 100 indices would result in real scaling issues when it comes to managing that number of indices and shards in Graylog and with Elasticsearch.

I could randomly split some of the logs out by their log names etc. but it’s complete chance as to whether I would split out the logs that make the biggest difference to the field count.

All our dashboards and all our alerts would have to be rewritten to search for logs across multiple indices.

This seems like a real scaling issue in Graylog, it doesn’t look like there is anything to help manage this issue in Graylog 5 either, no automation of index creation based on Beats configuration or similar.

This gets a list of the fields in an index but I could really do with knowing the number of times that field appears in the index to actually be useful:

index/_field_caps?fields=*

This is easy in SQL, there must be a way to do something similar in ES?

This is an Elasticsearch soft limit and is meant to protect against what they call “mapping exploision”, “which can cause out of memory errors and difficult situations to recover from”.

This limit can be increased, but it is not recommended.

This blog post explains how to recover from this: https://graylog.org/post/what-to-do-when-you-have-1000-fields/

Unfortunately it is difficult to give helpful guidance because every environment is different and guidance is highly dependent on your log data.

In my experience there are only a handful of things that can contribute to such large number of fields:

  • elastic beats
    • I strongly recommend to put each type of beat in its own index set
  • parsing json and setting the result as fields (taking EVERY json field and creating a corresponding json field)
    • This one is less common but is something to watch out for.

Even if you split out each beat type into its own stream/index set that would still only be about 5-6 index sets, which isn’t a lot.

All our dashboards and all our alerts would have to be rewritten to search for logs across multiple indices.

Can you clarify what you mean by this? By default all of graylog’s functions (search, dashboards, alerts) work with ALL log data across all streams. Do you already have these items configured to use an explicit stream or streams?

This seems like a real scaling issue in Graylog, it doesn’t look like there is anything to help manage this issue

I understand your frustration. We do provide documentation as well as this forum. We’ve also raised the topic of index management with the product teams internally so this is definitely something we’re aware of and want to improve.

This is easy in SQL, there must be a way to do something similar in ES?

Lucene is not designed as a relational database so it functions differently. It does appear that Elasticsearch have added some metrics APIs to answer your query but they don’t exist in a version of elaticsearch that is compatible with graylog.

Lastly, if you are interested, I put together a quick python script to automate the following tasks to better help you answer your question:

  1. get a list of all indices
  2. get a list of all fields from each index
  3. count how many documents contain that field (using the _count api and using an exists query).

See https://github.com/drewmiranda-gl/graylog-scripts/blob/main/Src/ES-OS-Count-Field-Usage/es-os-count-field-usage.py

Example usage:

python3 es-os-count-field-usage.py --api-url http://localhost:9200

Example output:
image

image

note that this will send a large amount of queries to your elasticsearch cluter, 1 query for each field for each index.

Hope this helps.

Hi @drewmiranda-gl

I’ve read the blog post previously (I think it’s several years old now). It doesn’t really help, I know what the problem and solution is, there are just no tools to help work out the most effective or efficient way of achieving that solution.

We already divide the logs up by Beat type/syslog etc., we then sub-divide after that (for example Active Directory logs are separated from Winlogbeat, Linux authentication logs are split out from Filebeat etc.). We already have ~30 indices, we’d need to add another ~100 or so to split up the Windows Applications and Services logs into separate indices.

We have a relatively mature Graylog environment so splitting existing indices into new indices will require a reasonable amount of rework on our Dashboards and Pipelines (we configure these for specific streams to speed up query time and remove duplicate event query items, e.g. Event ID is not a unique ID across logs but is within a specific Stream).

Thanks for the Python script, I’ll give that a go. We log around 50GB a day and several thousand logs a second so I’ll see how to affects the ES servers.

Graylog definitely needs some way of seeing into the Indices to get some idea of what is going on, Elasticsearch/Opensearch are pretty much a black box at the moment.

Hey @nick
couple ideas for you on looking into a index using some cURL commands

Example:

list mapping on a index

curl -XGET http://localhost:9200/filebeat-2023.08.10/_mapping?pretty 

That will list all the fields on an index called filebeat-2023.08.10. In your case it might be GL_something-001.

Part of the Results:

{
  "filebeat-2023.08.10" : {
    "mappings" : {
      "properties" : {
        "@timestamp" : {
          "type" : "date"
        },
        "@version" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "agent" : {
          "properties" : {
            "ephemeral_id" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "id" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 25

If you want everything then this should work.NOTE: you might want to use a pipe “|” or send it to a text file. It can be a long list.

curl -XGET http://localhost:9200/_mapping?pretty 

To get a mapping for a specific field, provide the index name and the field name:

curl -XGET http://localhost:9200/_mapping/field/<fields>
curl -XGET http://localhost:9200/<index>/_mapping/field/<fields>

NOTE: old fields will drop when the end of the Index retention is reached. You can avoid this by sending only the log/s you want to Graylog OR create a custom index template. OpenSearch /Elasticsearch default to dynamic mapping.
To sum it up, I never had to count fields but I used the above commands to sort out what i need and adjust my log shipper to comply with what i only needed. Hope that helps

Thanks @gsmith

I’ve already run all the above commands, they get me the list of fields (just like index/_field_caps?fields=*) but that doesn’t tell me how “popular” those fields are.

I see the fields dropping at the end of a 30 day index, but then within a day we’re back up at 1008 fields again.

I mean, ideally I’d love to know which log has the most fields.

have you had any luck with the python script? Also any luck routing the different beats to different streams/indexes?

Hi @drewmiranda-gl, as I said in previous posts we already route beats into different streams and indexes.

If you open up Event Viewer in Windows and open the Windows Applications and Services Logs folder and click down through the various sub folders you’ll see our issue (or in PowerShell run Get-WinEvent). There are over 100 logs here that we need to ingest and process. At the moment we bring most of them into a single stream and index (not all, there are some critical ones we pull out). It’s a lot of effort to create over 100 streams and indexes and corresponding rules for these to split each and every one out especially when some of them may only have 1 or 2 log entries.

This is the data we’re trying to get from ElasticSearch/Graylog. Which event logs should we spend time splitting out and which shouldn’t we. Neither Graylog nor ElasticSearch seem to have tools to help with this data analysis. Our logging system is the only place this data is centralised so it’s the only place we could possibly get the answers.

Are you doing any additional parsing on the winlog data that is ingesting, such as adding additional fields? Its unusual that winlogbeat would create more than 100-200 fields, and even that is a lot. I’m not sure how that could add up to 1000+ fields. How many of the fields are prepended with winlogbeat_ ? Do you have a field list you can provide?

Hi @drewmiranda-gl

No parsing, any logs that are parsed and modified are moved into a separated Stream and Index for queries/alerts etc…

Most of the fields are prepended with winlogbeat_, obviously there are a few Graylog ones etc.

I’m not sure people realise that Windows Applications and Services Logs exist a lot of the time, there are a log of logs in there. I’ll see if I can get a list.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.