Monday, August 22, 2022

Yarn/Spark Log aggregation - log4j, RootLogger, Appenders, yarn.log-aggregation-enable

 

Reference: https://medium.com/@iacomini.riccardo/spark-logging-configuration-in-yarn-faf5ba5fdb01

In Cloudera yarn.log-aggregation-enable is enabled by default

yarn-site.xml

<property>
<name>yarn.log-aggregation-enable</name>
<value>true</value>
</property>
<property>
<name>yarn.nodemanager.log-aggregation.roll-monitoring-interval- seconds</name>
<value>3600</value>
</property>


Friday, August 19, 2022

KaflaProducer and SparkStreaming - things to remember - auth keystore load & file ulimit - too many open files error

 

When Spark Streaming or Batch process writes to Kafka Topic

Be aware of the authentication done by org.apache.kafka.client.producer.KafkaPublisher send/doSend methods - which loads the keystore/keytab for authentication for each message it publishes

Which means it will read the keytab from the system so many times - so as the message count increases, the number of open file handles will increase extensively and can potentially cause the ULIMIT (max # of open file handles) to exceed causing too many open files error thrown and job fails

Make sure to publish the RDD - distributed way