Thursday, March 2, 2017

Why did my nested mistral workflow execution failed?

How to find the task execution that caused my mistral workflow execution to fail?

Good question! Answer is in the post (well, probably), don't be alarmed by some "TODO"s in the text.

The Problem 

For a cli/API user, before OpenStack Ocata, there was no short way to find the task that caused a workflow execution to fail, but now there is a new feature allowing to get all the failed tasks of a workflow recursively (for now it is still needed to execute a workflow to calculate the result).

I'm going to work with example in this post. If you want the shorter version, go to the mistral documentation or to the spec where this feature was designed. TODO add links.

Example workflow definition structure

Lets take the next workflow definition structure for example - a worklfow containing only one task of type workflow. Looks simple, but the nested workflow can also contain just one task of type workflow and so on until we are out of memory. So in my examples I am going to use just 5 levels of workflows.

This is the example structure (yes, all pictures in this post were made using paint):



Example workflow definition 1

To make it more clear here is a possible workflow with the same structure (one of many):
---
version : "2.0"
name: all_fail_wb
workflows:
  main_wf:
    tasks:
      task1:
        workflow: sub_wf_1_of_main_wf
  sub_wf_1_of_main_wf:
    tasks:
      task2:
        workflow: sub_wf_2_of_sub_wf_1
  sub_wf_2_of_sub_wf_1:
    tasks:
      task3:
        workflow: sub_wf_3_of_sub_wf_2
  sub_wf_3_of_sub_wf_2:
    tasks:
      task4:
        workflow: sub_wf_4_of_sub_wf_3
  sub_wf_4_of_sub_wf_3:
    tasks:
      task5:
        action: std.fail


If the workflow above is executed, it will fail and all tasks executions and action executions in it will also fail.



A user working with cli get and list commands only will need to call tasks-list with main wf ID to find task1 ID, and then find the execution of sub wf 1 using task1 ID, and then call tasks-list with sub wf 1 ID and so on until he gets tasks5 and can find the main wf execution failed because of task5.


Solution 

After a workflow execution fail you can use an execution of a different workflow to find the root cause. This works 99% of time.

---
version: "2.0"
wf_to_find_root_cause_of_task_failures:
  input:
    - main_wf_ex_id
    - info: "no info"
  tasks:
    tasks_of_execution_recursive_in_error_flat:
      action: std.noop
      publish:
        info: <% $.info %>
        my_tasks: <% tasks($.main_wf_ex_id, true, ERROR, true) %>


After running this workflow with the needed workflow execution ID as an input, it is possible to use the execution ID of workflow 'wf_to_find_root_cause_of_task_failures' to find the 'tasks_of_execution_recursive_in_error_flat' task and get the published values (mistral task-get-published <TASK_ID>)

How to use tasks function

  1. from within a mistral expression (can be Yaql or Jinja2)
  2. there are 4 parameters all optional. Read more on the mistral dsl documentation.
    1. execution_id
    2. recurcive
    3. state
    4. flat

Solution for workflow execution of workflow definition 1

For workflow definition 1 we will get only task5 in the publish of task 'tasks_of_execution_recursive_in_error_flat'.
Illustration image:



Example workflow definition 2

Another workflow definition that fits the structure:

---
version : "2.0"
name: all_fail_wb
workflows:
  main_wf:
    tasks:
      task1:
        workflow: sub_wf_1_of_main_wf
  sub_wf_1_of_main_wf:
    tasks:
      task2:
        workflow: sub_wf_2_of_sub_wf_1
  sub_wf_2_of_sub_wf_1:
    tasks:
      task3:
        workflow: sub_wf_3_of_sub_wf_2
  sub_wf_3_of_sub_wf_2:
    tasks:
      task4:
        workflow: sub_wf_4_of_sub_wf_3
        publish:
          fail_here_please:
  sub_wf_4_of_sub_wf_3:
    tasks:
      task5:
        action: std.noop

Solution for workflow execution of workflow definition 2

For workflow definition 2 we will get only task4 in the publish of task 'tasks_of_execution_recursive_in_error_flat'.

Illustration image:




Solution for workflow execution of workflow definition 3

Another workflow definition that fits the structure:
---
version : "2.0"
name: TODO_WRITE_IT

For workflow definition 3 we will get only task4 and task2 in the publish of task 'tasks_of_execution_recursive_in_error_flat'.
Illustration image:




mgershen


No comments:

Post a Comment