Broken UNIXPATH grok pattern?

Some file names in my ftp log parsed by gryalog have annoying german characters like ä, ö, ü. It seems, the grok pattern “UNIXPATH” does not like them. A file name “/home/my/blümchen.txt” is recognized as “/home/my/bl” which cause some other problems. My brilliant idea was, to fix the UNIPATH pattern to work “better”. I was trying to use logstash pattern (/[[[:alnum:]]_%!$@:.,+~-]*)+, which tested in grok tester (https://grokdebug.herokuapp.com) did the job, my “blümchen.txt” was recognized as expected.

But it seems, this patterm won’t work in graylog. Am I right, the [:alnum:] is not recognized by graylog? Is there any other way to “fix” the UNIXPATH pattern? Or is my whole approch to this problem wrong?

Centos7/Graylog3.3.7, UTF8 in locale seems to be set correct

Graylog uses Java regex, maybe this should work, i didn’t try:
(/[\p{Alnum}_%!$@:.,+~-]*)+
https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html

Thank you. I shall remember this. However… no. Changing \w to \p{Alnum} still gave me the same wrong result.

After playing with this pattern, this one (/[\w_%!$@:.,+~-[^ ]])+ worked fine, (/[.[^ ]])+ would do the job as well. But it matches ANYTHING except a space. Even if this one works with sample data I have provided, I feel it’s somehow wrong to make it so “loose”, less restrictive.

Why is there such a difference in \p{Alnum} or [:alnum:] between different systems? Are any mysterious system settings involved here? In the whole parsing/storage process, all such “unusual” characters (french, polish as well) are displayed correct, so I assume the system locale/encoding/other are set correct in my environment.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.