GROK extractor field data types and how they relate to ElasticSearch field data types


#1

I submitted this question a few weeks ago, and got no response, and the thread was auto closed (I don’t really understand why that’s a useful thing… but not my board:)
Hi, I’m seeing some discrepency between the data field type I set in my grok extractor, compared to how the index gets created in elasticsearch. I do not have any custom mappings for this entire index.

When I set “int” as the field data type in my grok extractor, that appears to be ignored, and the index field type gets created as “long”. In another case (not the extractor below) I set it to int, and the elasticsearch index was created as type “keyword” (yes I rotated the index after making changes).

My question is, if I use one of the field types listed in this doc, how does that relate to how the index field types are configured in elasticsearch? It seems like there is a “loose” correlation, but not a strict one. The “float” data type seems to be passed through from grok extractor to elasticsearch index.

I know I can configure a custom mapping, but I’d prefer to use those as little as possible, and instead put the data type in the grok so my users can create groks without my assistance making changes directly to elasticsearch.

Here is the grok extractor in question. The problematic fields with inconsistency in this case are “mysql_rows_sent, and mysql_rows_examined”.

^# User@Host: %{NOTSPACE:mysql_user} @ %{HOSTNAME:mysql_hostname}%{GREEDYDATA}\n# Query_time: %{BASE16FLOAT:mysql_query_time;float}\s+Lock_time: %{BASE16FLOAT:mysql_lock_time;float}\s+Rows_sent: %{BASE10NUM:mysql_rows_sent;int}\s+Rows_examined:\s+%{BASE10NUM:mysql_rows_examined;int}\nSET timestamp=%{BASE10NUM:timestamp};\n%{GREEDYDATA:mysql_query;string}
05/16 17:45[root@admin3]# curl -XGET localhost:9200/graylog_19/_mapping/field/mysql_*?pretty
    {
      "graylog_19" : {
        "mappings" : {
          "message" : {
            "mysql_lock_time" : {
              "full_name" : "mysql_lock_time",
              "mapping" : {
                "mysql_lock_time" : {
                  "type" : "float"
                }
              }
            },
            "mysql_query_time" : {
              "full_name" : "mysql_query_time",
              "mapping" : {
                "mysql_query_time" : {
                  "type" : "float"
                }
              }
            },
            "mysql_rows_sent" : {
              "full_name" : "mysql_rows_sent",
              "mapping" : {
                "mysql_rows_sent" : {
                  "type" : "long"
                }
              }
            },
            "mysql_rows_examined" : {
              "full_name" : "mysql_rows_examined",
              "mapping" : {
                "mysql_rows_examined" : {
                  "type" : "long"
                }
              }
            },
            "mysql_user" : {
              "full_name" : "mysql_user",
              "mapping" : {
                "mysql_user" : {
                  "type" : "keyword"
                }
              }
            },
            "mysql_query" : {
              "full_name" : "mysql_query",
              "mapping" : {
                "mysql_query" : {
                  "type" : "keyword"
                }
              }
            },
            "mysql_hostname" : {
              "full_name" : "mysql_hostname",
              "mapping" : {
                "mysql_hostname" : {
                  "type" : "keyword"
                }
              }
            }
          }
        }
      }
    }

(Jochen) #2

It depends. If you don’t have any custom index mappings, Elasticsearch will try to guess the type of each field when it’s created.

If your grok patterns are the only source of input, then the data type you’ve provided in your grok patterns will be used.
If there are other messages with different data types for certain fields are indexed into Elasticsearch first, then their data type for the respective message field will be used.

tl;dr: Create custom index mappings if you want to make sure that certain message fields always have a well-defined data type.
http://docs.graylog.org/en/2.4/pages/configuration/elasticsearch.html#custom-index-mappings


#3

Thanks Jochen,

So if I’m specifying a field type in the GROK pattern, and that field name hasn’t previously existed with another field type (or I’ve rotated my index since then), but the field type is NOT getting set in elasticsearch, that would be a graylog bug, correct? If so, I’ll file a bug report.


(Jochen) #4

Please create bug report at https://github.com/Graylog2/graylog2-server/issues and provide all necessary information to reproduce the issue (such as the complete Grok pattern and the patterns it depends on, some example messages, the configuration of the Grok extractor or the pipeline rule, and the Elasticsearch index mappings and templates).


(system) #5

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.