{"id":678,"date":"2024-07-10T12:22:04","date_gmt":"2024-07-10T16:22:04","guid":{"rendered":"https:\/\/bryanward.net\/wp\/?p=678"},"modified":"2024-07-10T12:22:05","modified_gmt":"2024-07-10T16:22:05","slug":"is-it-hot-enough-for-you-yet","status":"publish","type":"post","link":"https:\/\/bryanward.net\/wp\/2024\/07\/10\/is-it-hot-enough-for-you-yet\/","title":{"rendered":"Is it hot enough for you yet?"},"content":{"rendered":"\n<p>Today&#8217;s post concerns temperature thresholds for telecom facilities, and the rabbit hole I discovered when figuring out how to monitor temperature of Juniper equipment.<\/p>\n\n\n\n<p>If you read the datasheets for the Juniper <a href=\"https:\/\/www.juniper.net\/documentation\/us\/en\/hardware\/ex4300\/topics\/topic-map\/ex4300-site-guidelines-requirements.html#ex-series-environment__d22e33\" data-type=\"link\" data-id=\"https:\/\/www.juniper.net\/documentation\/us\/en\/hardware\/ex4300\/topics\/topic-map\/ex4300-site-guidelines-requirements.html#ex-series-environment__d22e33\">EX series<\/a> or QFX series (<a href=\"https:\/\/www.juniper.net\/documentation\/us\/en\/hardware\/qfx10002\/topics\/topic-map\/qfx10002-site-guidelines-requirements.html#qfx10002-environmental__d59e33\">10002<\/a>, <a href=\"https:\/\/www.juniper.net\/documentation\/us\/en\/hardware\/qfx5120\/topics\/topic-map\/qfx5120-site-guidelines-requirements.html\" data-type=\"link\" data-id=\"https:\/\/www.juniper.net\/documentation\/us\/en\/hardware\/qfx5120\/topics\/topic-map\/qfx5120-site-guidelines-requirements.html\">5120<\/a>), you&#8217;ll find that this equipment likes to operate in an environment &#8220;in temperature range of 32\u00b0F through 104\u00b0F (0\u00b0C through 40\u00b0C)&#8221;.  Some models push that as high as 45\u00b0C.  These are actually pretty generous ranges, especially considering that the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) &#8211; an organization that publishes standards for the HVAC industry, much like the IEEE or BICSI does for us networking folks &#8211; says that <a href=\"https:\/\/xp20.ashrae.org\/datacom1_4th\/ReferenceCard.pdf\">telecom facilities should have a design target between 18 and 27\u00b0C<\/a>.  Knowing that that range is laughable for most Telecom Rooms (and even some datacenters), ASHRAE goes on to provide alternate thresholds based on the type of equipment.  Most networking equipment would fall under Class A3, which only requires a temperature range of 5\u00b0C to 40\u00b0C, or Class A4, which is 5\u00b0C to 45\u00b0C.  It&#8217;s almost like Juniper knew this when they designed their products and published their datasheets&#8230;<\/p>\n\n\n\n<p>The motivation to write this article came about at 5PM yesterday, and again at 6AM this morning, when my phone blew up with alarms about some of my production equipment exceeding their temperature warnings.<\/p>\n\n\n\n<p>The shame is entirely on me.  I had been adding some rules to our production monitoring system yesterday, and configured the temperature thresholds based on the values published in Juniper&#8217;s datasheets (40 and 45\u00b0C, depending on the model).  The problem is that the sensors inside the equipment do not measure ambient temperature &#8211; they&#8217;re located in the chassis in places that, even during normal operation in a properly cooled environment, will regularly exceed those limits.<\/p>\n\n\n\n<p>So, how then can you determine which thresholds are actually appropriate based on the equipment and the sensors within?  Let&#8217;s enter the rabbit hole&#8230;  Well, unsurprisingly, there&#8217;s a command for that.  I had expected the threshold values to appear in the output of <code><a href=\"https:\/\/www.juniper.net\/documentation\/us\/en\/software\/junos\/cli-reference\/topics\/ref\/command\/show-chassis-environment.html\" data-type=\"link\" data-id=\"https:\/\/www.juniper.net\/documentation\/us\/en\/software\/junos\/cli-reference\/topics\/ref\/command\/show-chassis-environment.html\">show chassis environment<\/a><\/code>, but they do not &#8211; only the present operating values appear here, as such:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt; show chassis environment \nClass Item                           Status     Measurement\nPower FPC 0 Power Supply 0           OK        \n      FPC 0 Power Supply 1           OK        \nTemp  FPC 0 PSU Sensor               OK         38 degrees C \/ 100 degrees F\n      FPC 0 MGE-PHY Sensor           OK         42 degrees C \/ 107 degrees F\n      FPC 0 DC-DC Sensor             OK         46 degrees C \/ 114 degrees F\n      FPC 0 UM CONN Sensor           OK         36 degrees C \/ 96 degrees F\n      FPC 0 CPU Sensor               OK         41 degrees C \/ 105 degrees F\n      FPC 0 MAC Sensor               OK         45 degrees C \/ 113 degrees F\nFans  FPC 0 Fan Tray 0 Fan 0         OK         Spinning at normal speed\n      FPC 0 Fan Tray 0 Fan 1         OK         Spinning at normal speed\n      FPC 0 Fan Tray 1 Fan 0         OK         Spinning at normal speed\n      FPC 0 Fan Tray 1 Fan 1         OK         Spinning at normal speed\n\n\n&gt; show chassis environment fpc 0 \nFPC 0 status:\n  State                      Online\n  Temperature                             38 degrees C \/ 100 degrees F            \n<\/code><\/pre>\n\n\n\n<p>Notice how dangerously close these values are to the datasheet limits of 40 and\/or 45\u00b0C!  No wonder my alarms all started to go off!<\/p>\n\n\n\n<p>It turns out, there&#8217;s another command I should have looked at: <code><a href=\"https:\/\/www.juniper.net\/documentation\/us\/en\/software\/junos\/cli-reference\/topics\/ref\/command\/show-chassis-temperature-thresholds.html\" data-type=\"link\" data-id=\"https:\/\/www.juniper.net\/documentation\/en_US\/junos12.3\/topics\/reference\/command-summary\/show-chassis-temperature-thresholds-command.html\">show chassis temperature-thresholds<\/a><\/code><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt; show chassis temperature-thresholds \n                                        Fan speed      Yellow alarm      Red alarm      Fire Shutdown\n                                       (degrees C)      (degrees C)     (degrees C)      (degrees C)\nItem                                  Normal  High   Normal  Bad fan   Normal  Bad fan     Normal\nFPC 0 PSU Sensor                          40    61       64       64       68       68         73\nFPC 0 MGE-PHY Sensor                      48    68       71       71       75       75         80\nFPC 0 DC-DC Sensor                        45    66       69       69       73       73         78\nFPC 0 UM CONN Sensor                      44    64       67       67       71       71         76\nFPC 0 CPU Sensor                          47    67       70       70       74       74         80\nFPC 0 MAC Sensor                          50    71       73       73       76       76         80\n<\/code><\/pre>\n\n\n\n<p>That&#8217;s much more reasonable!  The limits there are clearly displayed for each sensor, including several tiers of alarm.<\/p>\n\n\n\n<p>There is an <a href=\"https:\/\/community.juniper.net\/discussion\/changing-temperature-thresholds-on-ex-switch\" data-type=\"link\" data-id=\"https:\/\/community.juniper.net\/discussion\/changing-temperature-thresholds-on-ex-switch\">old post from Steve Puluka<\/a> that clued me into the presence of a hidden command to change these built-in alarm thresholds, however it&#8217;s best to not use it.  These thresholds have been factory-set based on the capabilities of the equipment.  If you&#8217;re curious though, here is it:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code># set chassis temperature-threshold ?\nPossible completions:\n+ apply-groups         Groups from which to inherit configuration data\n+ apply-groups-except  Don't inherit configuration data from these groups\n  fire-shutdown        Threshold at which router will be shutdown within 10 seconds (degrees C)\n  red-alarm            Threshold at which red alarm is set (degrees C)\n  red-alarm-if-failed-fan  Threshold at which red alarm is set when bad fan present (degrees C)\n  yellow-alarm         Threshold at which yellow alarm is set (degrees C)\n  yellow-alarm-if-failed-fan  Threshold at which yellow alarm is set when bad fan present (degrees C)<\/code><\/pre>\n\n\n\n<p>If you actually leverage the built-in &#8220;alarm&#8221; status of your equipment, and you&#8217;re operating in extreme environments, you may find that you want to adjust these.  The thread that I linked to goes on to say that this command only affects logging and not SNMP traps, and there&#8217;s no guarantee that it will properly trigger the built-in &#8220;alarm&#8221; status either.  So, YMMV, and again my recommendation is to not mess with this.  It&#8217;s best to fix the cooling situation in your facility than to change these thresholds.<\/p>\n\n\n\n<p>Alright, what if you&#8217;re not using the built-in alarm status, and you want to query each sensor directly in your NMS?  Unsurprisingly, there are <a href=\"https:\/\/supportportal.juniper.net\/s\/article\/EX-How-to-check-temperature-CPU-memory-usage-by-SNMP-OID?language=en_US\">SNMP OIDs to do just that<\/a>.  The easiest one to use is <code>jnxOperatingTemp<\/code>, or <code>1.3.6.1.4.1.2636.3.1.13.1.7<\/code> .  This corresponds to the output of <code>show chassis environment fpc<\/code> .<\/p>\n\n\n\n<p>One more note.  You should be monitoring the FPCs and not the REs.  If you monitor just the REs, then you will not get the status of any line-cards in a chassis, only the master and backup RE, and could miss events occurring on just the line-cards.  To put this in SNMP terms:<br><code>1.3.6.1.4.1.2636.3.1.13.1.7.7.<strong>1<\/strong>.0.0<\/code> is <code>show chassis environment fpc 0<\/code>  <br><code>1.3.6.1.4.1.2636.3.1.13.1.7.7.<strong>2<\/strong>.0.0<\/code> is <code>show chassis environment fpc 1<\/code> <br><code>1.3.6.1.4.1.2636.3.1.13.1.7.7.<strong>3<\/strong>.0.0<\/code> is <code>show chassis environment fpc 2<\/code> <br>etc.<\/p>\n\n\n\n<p><code>1.3.6.1.4.1.2636.3.1.13.1.7.<strong>9.1<\/strong>.0.0<\/code> is <code>show chassis environment re 0<\/code> and <code>1.3.6.1.4.1.2636.3.1.13.1.7.<strong>9.2<\/strong>.0.0<\/code> is <code>show chassis environment re 1<\/code>.<\/p>\n\n\n\n<p>If you&#8217;re using the <a href=\"https:\/\/datatracker.ietf.org\/doc\/html\/rfc6241\">RFC 6241 NETCONF<\/a> API to do your monitoring (which is worth looking into), you can use the following RPC to get the same data:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&lt;get-environment-fpc-information&gt;\n    &lt;fpc-slot&gt;0&lt;\/fpc-slot&gt;\n&lt;\/get-environment-fpc-information&gt;<\/code><\/pre>\n\n\n\n<p>I had hoped to stop writing at this point, but the thresholds for this measurement point are NOT in the output of <code>show chassis temperature-thresholds<\/code> !  So, what temperatures should be cause for concern when looking at this sensor?<\/p>\n\n\n\n<p>Well, it seems there&#8217;s a command for that too, but it doesn&#8217;t seem to work on any platform I&#8217;ve tried!<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>&gt; show chassis temperature-thresholds fpc 0     \nerror: command is not valid on the ex3400-48p\n\n&gt; show chassis temperature-thresholds fpc 0 \nerror: command is not valid on the ex4300-48mp\n\n&gt; show chassis temperature-thresholds fpc 0 \nerror: command is not valid on the qfx10002-72q<\/code><\/pre>\n\n\n\n<p>I also tried walking the entire SNMP MIB tree to see if there&#8217;s any OID or value that makes sense for an upper limit to jnxOperatingTemp.  Long story short, there aren&#8217;t any OIDs that specify the thresholds of jnxOperatingTemp.<\/p>\n\n\n\n<p>Juniper&#8217;s <a href=\"https:\/\/www.juniper.net\/documentation\/us\/en\/software\/junos\/network-mgmt\/topics\/topic-map\/monitoring-network-service-quality-by-using-rmon.html#id-measuring-health__d130e23\">RMON documentation<\/a> for this measurement only says &#8220;Allowable range: To be baselined&#8221;, which isn&#8217;t very useful either.<\/p>\n\n\n\n<p>Juniper&#8217;s HealthBot actually does make an <a href=\"https:\/\/github.com\/Juniper\/healthbot-rules\/blob\/master\/juniper_official\/Chassis\/README.md#rule-name-check-chassis-temperature\">attempt at defining a threshold<\/a> though.  It will trigger a yellow alert if the temperature rises above 45\u00b0C, and a red alert if it rises above 55\u00b0C.  HealthBot&#8217;s logic does not take the hardware model into account, and it&#8217;s not entirely clear which measurement it is looking at.<\/p>\n\n\n\n<p>Some other engineers like <a href=\"https:\/\/lkhill.com\/mx-upgrade-overheat\/\">Lindsay<\/a> and <a href=\"https:\/\/51sec.weebly.com\/blog\/using-snmpv3-monitoring-juniper-srx-240h-alarm-andtemperature\">Jon<\/a> have published their own articles detailing their woes of troubleshooting Juniper temperature issues.  They make reference to Junos logging alarms when the temperature reaches 50\u00b0C.  Lindsay&#8217;s article also reveals the scary fact that Junos will shut down some types of SFP modules at this 50\u00b0C limit as well.  This shutdown behavior is related to the CHASSISD_FPC_OPTICS_HOT_NOTICE log message.  Other similar nmemonics include CHASSISD_TEMP_HOT_NOTICE, CHASSISD_FRU_HIGH_TEMP_CONDITION, and the dreaded CHASSISD_OVER_TEMP_SHUTDOWN_TIME at which the entire router\/switch will shut down.<\/p>\n\n\n\n<p>So, if Junos shuts things down at 50\u00b0C, I think HealthBot&#8217;s threshold of 55\u00b0C is far too high.  On the other end, as I experienced in my environment, a facility ambient temperature of 19\u00b0C (66\u00b0F) &#8211; quite chilly &#8211; still had EX switches reporting an FPC temperature of over 40\u00b0C.  In another much warmer facility, I have a QFX10k2 router reporting an FPC operating temperature of 55\u00b0C, and no SFPs have been shut down yet.  Since I&#8217;ve yet to find an authoritative answer on what the limit for jnxOperatingTemp should be, I would think having my NMS alert at somewhere between 45 and 49\u00b0C is a safe starting point.<\/p>\n\n\n\n<p>But I wanted to take a more precise approach.  I decided that I wouldn&#8217;t alert on jnxOperatingTemp, but would instead alert on the FPC CPU Temp, which does has defined alarm thresholds as discussed earlier here.  One problem though &#8211; there is no SNMP OID for FPC CPU Temp!  It is available via NETCONF at <code>fpc-information\/fpc\/temperature<\/code> though&#8230;<\/p>\n\n\n\n<p>Aha!  That generic use of &#8220;fpc\/temperature&#8221; lead me to discover that what SNMP calls jnxOperatingTemp is actually the same as FPC CPU Temp for the EX3400 series, which has defined thresholds as seen in the output of <code>show chassis temperature-thresholds<\/code>!<\/p>\n\n\n\n<p>So, there are, in fact, defined thresholds for jnxOperatingTemp &#8211; they&#8217;re just displayed as something else!  Doh!<\/p>\n\n\n\n<p>What are they displayed as then? It gets worse, as that depends on the model, and sadly here the logic could use some work. Rather than consistently using the CPU Temp regardless of platform, it just picks the first temperature sensor listed in the output of <code>show chassis environment<\/code> to use as the value for the generic FPC Temperature measurement. What that means is that the NETCONF value for fpc\/temperature is the following:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td><strong>Platform<\/strong><\/td><td><strong>fpc\/temperature Measurement Source<\/strong><\/td><td><strong>Yellow Alarm<\/strong><\/td><td><strong>SNMP OID<\/strong><\/td><\/tr><tr><td>EX2300C<\/td><td>FPC CPU Sensor<\/td><td>65<\/td><td>1.3.6.1.4.1.2636.3.1.13.1.7.x.0.0<\/td><\/tr><tr><td>EX3400<\/td><td>FPC CPU Temp<\/td><td>63<\/td><td>1.3.6.1.4.1.2636.3.1.13.1.7.x.0.0<\/td><\/tr><tr><td>EX4300MP<\/td><td>FPC PSU Sensor<\/td><td>64<\/td><td>1.3.6.1.4.1.2636.3.1.13.1.7.x.0.0<\/td><\/tr><tr><td>EX4400<\/td><td>FPC Main Board Sensor 1<\/td><td>69<\/td><td>1.3.6.1.4.1.2636.3.1.13.1.<strong>9<\/strong>.x.0.0<br>(.<strong>7<\/strong>.x.0.0 appears to be the CPU Die Temp Sensor)<\/td><\/tr><tr><td>QFX5110<\/td><td>FPC Sensor TopLeft I<\/td><td>69<\/td><td>1.3.6.1.4.1.2636.3.1.13.1.<strong>9<\/strong>.x.0.0<br>(I cannot determine where .7.x.0.0 gets its value from.)<\/td><\/tr><tr><td>QFX10002<\/td><td>FPC Intake Temp Sensor<\/td><td>65<\/td><td>1.3.6.1.4.1.2636.3.1.13.1.<strong>9<\/strong>.x.0.0<br>(.<strong>7<\/strong>.x.0.0 appears to be the CPU Die Temp Sensor)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>You&#8217;ll see that these also don&#8217;t line up nicely with the SNMP OIDs either., SNMP OID <code>1.3.6.1.4.1.2636.3.1.13.1.7.7<\/code> on the EX3400 platform is the same as the reported FPC CPU Temp. The same SNMP OID on the QFX10k2 platform is reported as the FPC CPU Die Temp, but <code>1.3.6.1.4.1.2636.3.1.13.1.7.9<\/code> on the QFX10k2 is the same as FPC Intake Temp. That .9 OID refers to the Routing Engine, which on a QFX10k2 I guess makes sense, as you can&#8217;t connect QFX10k2s into a virtual chassis&#8230; But you can connect EX4400s and QFX5ks into virtual chassis&#8230;<\/p>\n\n\n\n<p>To recap: the EX3400 FPC CPU Temp and the QFX10k2 FPC Intake Temp are the values as shown in the command <code>show chassis environment fpc<\/code>. This same value is also shown in the output of the command <code>show chassis fpc<\/code>.  This same measurement is also available via NETCONF RPC <code>get-fpc-information<\/code> and returns the value in <code>fpc-information\/fpc\/temperature<\/code>.  The command <code>show chassis temperature-thresholds<\/code> lists the alarm limits for these values.<\/p>\n\n\n\n<p>Since the alarm limits do vary by platform (and possibly by HW revision or FW version), I suggest you set your NMS to alarm based on the output of <code>show chassis temperature-thresholds<\/code> for your specific equipment, and make sure that it&#8217;s monitoring the correct SNMP OID or NETCONF RPC. After combing thru all the equipment in my environment, 63\u00b0C (145\u00b0F) of <code>fpc-information\/fpc\/temperature<\/code> seems like a very reasonable value at which I should be alerted and taking action. Some equipment has higher alarm limits, but I wouldn&#8217;t want anything to be running hotter than that anyway.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today&#8217;s post concerns temperature thresholds for telecom facilities, and the rabbit hole I discovered when figuring out how to monitor temperature of Juniper equipment.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[9,10,42,43],"class_list":["post-678","post","type-post","status-publish","format-standard","hentry","category-uncategorized","tag-juniper","tag-junos","tag-monitoring","tag-temperature"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/posts\/678","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/comments?post=678"}],"version-history":[{"count":14,"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/posts\/678\/revisions"}],"predecessor-version":[{"id":692,"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/posts\/678\/revisions\/692"}],"wp:attachment":[{"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/media?parent=678"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/categories?post=678"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/bryanward.net\/wp\/wp-json\/wp\/v2\/tags?post=678"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}