So, just to clarify for myself, "parallel" without loop, only has "num_gangs" copies executed, independent of "num_workers" and "vector_length" sizes, whereas "parallel loop num_gangs(a) num_workers(b) vector_length(c)" would divide the total number of loop executions into "a*b*c" independent chunks.
Correct. A parallel region without a work-shared loop (i.e. without a loop directive) is run in "gang-redundant mode", each gang executes the same exectutable statements. So it's not that using "num_workers" here is invalid, it's just not applicable in this case.
If you did have a loop(s) with a gang, worker, and/or vector clause, then, yes, it would divide the loop across the total number of gangs, workers, and vectors. Though best practice is to not specifically set the number of gangs, workers, or vectors except for specific tuning of algorithms and instead let the compiler and runtime define these values based on the target device. Different target devices may need different values so setting these size will reduce performance portability.