Impurity embedding for ML: Part1: Data generation: v01: Step5: Classification

Author: johannes.wasmer@gmail.com.

Version: v01.

Date: see repository.

Description:

The purpose of this notebook is to collect all results of Step4, to classify and inspect them.

Date of this report:

Some database statistics

TODO Use the aiida-jutools sisclab2020 project notebooks for this instead.

Make sure groups are sums of their subgroups

Classify & group results by finished_ok / failed

Do this each time after having run one of the Step2-Step4 submission loops! Reason: some postprocessing methods rely on assumption that subgroups finished_ok / failed are always up to date with their base group. Cannot do this in the submission loop, since workchains must be finished first.

Can use this to check progress in duplicate submission notebook.

Further subdivide finished_ok groups into conerged / not_converged.

Can't do this in jutools process classifier cause this is kkr-specific.

Note: This is last-minute thesis submission code. Should be refactored into a module with load_or_create paradigm.

To show that can query processes by date

Manualyl, from classification above:

num_ok = 190 + 8521 # (host_scf/finished_ok + imp/finished_ok, cause hih is subset of the latter)
num_fail = 3 + 1892 # 
num_tot = num_ok + num_fail
failure_rate = num_fail / num_tot
failure_rate, num_tot

(0.1786724495568546, 10606)

So 10606 workchains are grouped of a total of a total of 31520. That's interesting. What are all these ungrouped workchains? Should find out!

Answer: ah, it's just the kkr_startpot_wc and kkr_imp_sub_wc, so the subworkflows of the grouped kkr_scf_wc and kkr_imp_wc.

To check that this is the case, let's count all kkr_scf_wc and kkr_imp_wc by query. They should roughly equate to the numbers from the classification, rather than the one from the simple query above.

num_kkr_scf_wc = jutools.process.query_processes(node_types=[WorkChainNode], process_label='kkr_scf_wc').count()
num_kkr_imp_wc = jutools.process.query_processes(node_types=[WorkChainNode], process_label='kkr_imp_wc').count()
num_kkr_scf_wc + num_kkr_imp_wc

10611

Yes, so that fits again (the extra 5 wcs here are in the aiida_kkr_tutorial groups).

Doublecheck ProcessClassifier counts

Plot completion matrix

Convergence statistics

Runtime statistics

Note that the workchain count here is exactly the same as the finished_ok workchain count in the process classification below. So the total running times could only be extracted from finished_ok workchains.

Step2 HostSCF Submission plan

See Step2 notebook.

Step3 Host GF Write-out Submission plan

Step4 Submission plan

The EmbeddingsEnumerator defined the Step4 submission plan: which imp:host kkr_imp_wc should be submitted.

This number is of course larger than the output of Step4: the actual number of finished_ok kkr_imp_wc workflows.

Let's spell the numbers out a bit more.

The questions to answer now are:

More results inspections

Inspector: Find missing, count duplicates

Runtime statistics

Check recently submitted workchains WIP

Here we want to check if the most recently submitted workchains had some common problems.

Now check why the failed ones failed.

Investigate failed

Look at the elements of the failed kkr_im_wc. Is there a trend?


Make sure that no imp wcs had wrong code input kkr=kkrhost, kkrimp=kkrhost.

Delete excepted/killed workchains WIP

Visualize one kkr_imp_wc