Uploaded image for project: 'JDK'
  1. JDK
  2. JDK-8072587

ScheduledThreadPoolExecutor stalled

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: P4
    • Resolution: Cannot Reproduce
    • Affects Version/s: 7u67
    • Fix Version/s: None
    • Component/s: core-libs
    • Labels:

      Description

      FULL PRODUCT VERSION :
      /usr/lib/jvm/java-1.7.0-openjdk-1.7.0.45.x86_64/jre (OpenJDK 64-Bit Server VM, 24.45-b08)

      but not fixed in 7u67.

      ADDITIONAL OS VERSION INFORMATION :
      Red Hat Enterprise Linux Server release 6.2 (Santiago) 2.6.32-220.7.1.el6.x86_64

      A DESCRIPTION OF THE PROBLEM :
      A customer (ref: ZD-23744) using Jenkins reported that sometimes after the server is running for a while it stops scheduling builds. This is normally handled by a task (hudson.triggers.Trigger.Cron) recurring every one minute, which is scheduled using scheduleAtFixedRate on jenkins.util.Timer.get() (a ScheduledThreadPoolExecutor with corePoolSize=10 and a thread factory customize Thread.name and daemon status). Was initially using 7u45; problem persisted in 7u67.

      Analysis using the Groovy script console showed that none of the dozen or more tasks in the system had run for over a day (besides the once-per-minute Cron there are various others with regular or irregular schedules). The DelayedWorkQueue had the expected ScheduledFutureTask’s, but many of them had a ‘time’ well before the current system time. For example, in a normal Jenkins instance the diagnostic script

      println("${new Date()}\tnow");
      long baseline = System.currentTimeMillis() * 1000 * 1000 - System.nanoTime();
      def f = java.util.concurrent.FutureTask.class.getDeclaredField('callable');
      f.accessible = true;
      jenkins.util.Timer.get().queue.each {t ->
        def callable = f.get(t);
        try {
          callable = callable.task;
        } catch (MissingPropertyException _) {}
        println("${new Date((long) (baseline + t.time) / 1000 / 1000)}\t${t.period / 1000 / 1000 / 1000}s\t${callable}");
      }; null;

      would show among other entries

      Thu Jan 29 15:39:07 EST 2015 now
      Thu Jan 29 15:40:00 EST 2015 60s hudson.triggers.Trigger$Cron@403aaa7

      with Cron always being scheduled for the upcoming round minute. In the problematic instance, Cron was scheduled for some random-seeming round minute hours or even days before.

      Nothing related to recurrent tasks was happening in the thread dump; all the pool threads were waiting in the usual way:

      "jenkins.util.Timer [#4]" #… daemon prio=5 os_prio=0 tid=0x… nid=0x… waiting on condition [0x…]
         java.lang.Thread.State: WAITING (parking)
          at sun.misc.Unsafe.park(Native Method)
          - parking to wait for <0x…> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
          at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1088)
          at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
          at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
          at java.lang.Thread.run(Thread.java:745)

      The toString version of the service looked normal enough, and showed that it had really been running for quite some time until now:

      java.util.concurrent.ScheduledThreadPoolExecutor@…[Running, pool size = 10, active threads = 0, queued tasks = 40, completed tasks = 449890]

      The following script, designed to show what is next up in the queue

      def f = java.util.concurrent.FutureTask.class.getDeclaredField('callable');
      f.accessible = true;
      def x = f.get(jenkins.util.Timer.get().queue.poll(5, java.util.concurrent.TimeUnit.SECONDS));
      println(x);
      println(x.task);

      also printed normal-looking output:

      java.util.concurrent.Executors$RunnableAdapter@…
      com.cloudbees.jenkins.plugins.nodesplus.OwnerNodeProperty$PeriodicWorkImpl@…

      (where OwnerNodeProperty.PeriodicWorkImpl is another fairly frequent recurring task).

      The problem comes when you look at the delay:

      println(jenkins.util.Timer.get().queue.queue[0].getDelay(java.util.concurrent.TimeUnit.NANOSECONDS) / 1000.0 / 1000 / 1000);

      Normally the result is a positive number less than 1. But on this instance the result was -68507.216428714: the head of the queue was supposed to have been run long ago.

      Now it gets interesting. I had the customer run this script:

      def q = jenkins.util.Timer.get().queue;
      q.lock.lockInterruptibly();
      try {
        q.available.signal();
      } finally {
        q.lock.unlock();
      }

      which sends a fresh signal to DelayedWorkQueue.available. Suddenly the whole system returned to normal and resumed scheduling builds every minute and so on. Indeed if the following script is run at system startup time:

      import hudson.triggers.*;
      // Trigger.timer is a deprecated java.util.Timer, to avoid using java.util.concurrent for the supervisor:
      Trigger.timer.scheduleAtFixedRate(new SafeTimerTask() {
        protected void doRun() {
          if (jenkins.model.Jenkins.instance.getExtensionList(Trigger.Cron)[0].cal.timeInMillis - System.currentTimeMillis() < -120000) {
            // Cron has not run recently. Log warning, then…
            def q = jenkins.util.Timer.get().queue;
            q.lock.lockInterruptibly();
            try {
              q.available.signal();
            } finally {
              q.lock.unlock();
            }
          }
        }
      }, 60000, 60000);

      then the symptom does not recur.

      My guess here is that on occasion, due to factors unknown, available.signal() is called but the notification is dropped: nothing happens to be waiting at that moment and available to respond. At that point the whole service simply stalls waiting for a signal which will never come.

      STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
      Not known to reproduce, unfortunately.


      ERROR MESSAGES/STACK TRACES THAT OCCUR :
      No relevant errors seen.

      REPRODUCIBILITY :
      This bug can be reproduced occasionally.

        Attachments

          Activity

            People

            Assignee:
            martin Martin Buchholz
            Reporter:
            webbuggrp Webbug Group
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: