FYI, This Graylog is deployed with Helm (https://github.com/helm/charts/tree/master/stable/graylog) in a performance test environment (1.8TB logged yesterday)
Increasing Graylog heap size (form 7G to 12GB, and then to 18GB) seems fixed the issue.
This change with output_batch_size change from default 500 to 12,000 also fixed the Process/Output buffer 100% issue (I tried doubling the resource of ElasticSearch first, no significant improvement. The ElasticSearch data nodes are under utilized).
Current load is low. I will continue observing the load/performance and update if there is any change.