This directory contains examples for four different methods of profiling Julia code.

Note that Julia support in TAU is experimental and at an early stage of development.
For each method, this file contains instructions on using the method and lists
the current known issues and limitations of the method.

To use TAU with Julia support:

    - Install Julia 1.12 and ensure that the `julia` interpreter is on your path
      and you are able to run uninstrumented Julia code. For example, you should
      be able to run the `Uninstrumented.jl` file in this directory with:

          julia Uninstrumented.jl

     and should get output like:
     
          Running loop of size 1
          Running loop of size 1000
          Running loop of size 10000
          Done

    - Build TAU with Julia, pthread, and ITTNotify support.
      For example:

          ./configure -bfd=download -dwarf=download -pthread -ittnotify -julia
          make install

      If you want to use unwinding support, TAU should be built with the same version
      of libunwind as Julia.

When TAU is built with Julia support, a variant of `tau_exec` named `tau_julia` is installed
which should be used in place of the Julia interpreter. Using `tau_julia` will place the 
TAUProfile.jl module in the $JULIA_LOAD path and enable Julia's use of Intel JIT events to support
sampling.

-----------------------------------------------------

Example 1: Manual Instrumentation

TAUProfile.jl provides access to methods to start and stop TAU timers from Julia code.
Importing the module will initialize TAU. Methods `tau_start(s::String)` and `tau_stop(s::String)`
are exposed. 

The file `ManualInst.jl` contains an example of using manual instrumentation with the TAU Julia interface.
To use it (assuming that you have built TAU as described above), run the command:

    tau_julia -T serial,julia,ittnotify,pthread -- --project=. ./src/ManualInst.jl

This will generate a profile.0.0.0 file which you can view with `pprof`:

    pprof

    Reading Profile files in profile.*

    NODE 0;CONTEXT 0;THREAD 0:
    ---------------------------------------------------------------------------------------
    %Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
                msec   total msec                          usec/call
    ---------------------------------------------------------------------------------------
    100.0        0.286          745           1           1     745936 .TAU application
    100.0          238          745           1           4     745650 taupreload_main
    68.0           506          506           1           0     506986 manual_timing_example
    0.0          0.321        0.321           1           0        321 pthread_barrier_wait
    0.0          0.118        0.118           2           0         59 pthread_create

Note that the timer for `manual_timing_example` is present; this is the timer that was created with 
tau_start() in the code.

-----------------------------------------------------

Example 2: Instrumentation Macros

It is inconvenient to have to manually instrument the entry and every exit from functions.
Julia itself provides profiling macros (albeit ones that use sampling rather than direct
instrumentation) of the form `@profile <expr>` which enables profiling while executing the
given expression. TAUProfile.jl provides similar macros:

    @tau <name> <expr>           Start a TAU timer named <name>, evaluate the expression
                                 <expr>, and stop the timer.

    @tau_func <func defn>        Wrap an entire function in a TAU timer. The name of the 
                                 timer will be inferred from the name of the function.

The file `MacroInst.jl` contains examples of manual instrumentation using macros.
It demonstrates single-level and nested expression profiling and function profiling
using both the `function foo() ... end` syntax and the `foo() = ...` syntax.

To use it, run the command:

    tau_julia -T serial,julia,ittnotify,pthread -- --project=. ./src/MacroInst.jl

This will generate a profile.0.0.0 file which you can view with `pprof`:

    pprof

	NODE 0;CONTEXT 0;THREAD 0:
	---------------------------------------------------------------------------------------
	%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
				msec   total msec                          usec/call
	---------------------------------------------------------------------------------------
	100.0        0.293        1,120           1           1    1120326 .TAU application
	100.0          410        1,120           1           7    1120033 taupreload_main
	36.1           201          404           1           2     404806 outer_timer
	27.1           304          304           1           0     304140 macro_timer
	9.1            101          101           1           0     101874 inner_computation_2
	9.1            101          101           1           0     101440 inner_computation_1
	0.0          0.413        0.413           1           0        413 pthread_barrier_wait
	0.0          0.035        0.394           1           3        394 tau_func_example
	0.0          0.355        0.355           1           0        355 matrix_multiply
	0.0          0.107        0.107           2           0         54 pthread_create
	0.0          0.073        0.073         177         176          0 fibonnaci
	0.0          0.003        0.003           1           0          3 vector_operations
	0.0          0.001        0.001           1           0          1 quick_sum

-----------------------------------------------------

Example 3: Sampling of Uninstrumented Code

Sampling can be used to profile unmodified code. Because Julia uses just-in-time
compilation, compiling functions at runtime upon invocation once the argument
types are known, debug information for symbol resolution is not available through
the standard debug symbol tables. The Julia runtime can be configured to
provide information on address ranges as functions are compiled so that
debuggers and profilers can resolve addresses to function names. 
To do this, the JIT Events API within the Intel ITTNotify interface are used.

Julia must be built with the compile-time option USE_INTEL_JITEVENTS=1.
This is the default for the standard distribution of Julia on x86_64 Linux, but
is *not* the default on other platforms. 

If you are running on any platform other than x86_64 Linux, you will have to build
Julia from source and explicitly enable USE_INTEL_JITEVENTS=1.
Build instructions are located at https://docs.julialang.org/en/v1/devdocs/build/build/
User-configurable options like USE_INTEL_JITEVENTS=1 are specified in the
`Make.user` file.
 
To use sampling, specify both `-ebs` and `-ittnotify` as arguments to `tau_julia`.
The argument -ebs enabled event-based sampling, while -ittnotify enabled TAU's 
Intel ITTNotify collector, which is used to map addresses to names.

The example `Uninstrumented.jl` is a code without any TAU instrumentation.
It can be profiled with sampling by running:

    tau_julia -T serial,julia,ittnotify,pthread -ebs -ittnotify -- --project=. ./src/Uninstrumented.jl

Run `pprof -a` to view the profile. 

-----------------------------------------------------

Example 4: IR Rewriting

For Python, TAU features a `tau_python` tool which allows for automatic
profiling of all functions invoked by the Python interpreter using a profiling
interface provided by Python. Julia does not have such an interface, but
similar results can be achieved by recursively rewriting the intermediate
representations of Julia functions. While this still requires modification to the code,
so long as there is a single entry-point to the code, only a single modification is
needed. 

To use IR rewriting, the entry point to a Julia code should be annotated with 
the `@tau_rewrite` macro. For example, if the script calls a function

    main()

This will be changed to 

    @tau_rewrite main()

Note that this recurses into every function call made by the function or any child.
This introduces overhead during startup, as the modified versions of every called function
must be compiled.

The example `RewriteInst.jl` contains an example of IR rewriting.
Rewriting is enabled by adding
    
    using TAUProfile

at the beginning and marking the entrypoint with @tau_rewrite:

    @tau_rewrite rewrite_example()

The example can be run with:

    tau_julia -T serial,julia,ittnotify,pthread -- --project=. ./src/RewriteInst.jl

and the profile viewed with `pprof`:

    pprof

    NODE 0;CONTEXT 0;THREAD 0:
    ---------------------------------------------------------------------------------------
    %Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
                msec   total msec                          usec/call
    ---------------------------------------------------------------------------------------
    100.0         0.307       13,463           1           1   13463316 .TAU application
    100.0        13,430       13,463           1           4   13463009 taupreload_main
    0.2              18           31           1           4      31880 rewrite_example
    0.1           0.004            8           1           1       8737 my_sum
    0.1           0.012            8           1           3       8733 sum
    0.1           0.011            8           1           4       8084 #sum#735
    0.1            0.01            7           2           6       3884 _sum
    0.1           0.006            7           1           4       7156 #_sum#737
    0.0           0.008            6           1           4       6251 #_sum#738
    0.0           0.007            5           1           2       5950 mapreduce
    0.0           0.004            5           1           1       5941 #mapreduce#728
    0.0           0.006            5           1           2       5937 _mapreduce_dim
    0.0           0.013            5           1           8       5925 _mapreduce
    0.0           0.292            5           2         608       2926 mapreduce_impl
    0.0           0.186            4          98         294         46 simd_index
    0.0           0.006            3           1           1       3831 collect
    0.0           0.298            3           1           1       3825 Array
    0.0               1            3           2         305       1764 Vector{Int64}
    0.0             0.2            2          98         294         30 firstindex
    0.0           0.085            2          98          98         22 eachindex
    0.0           0.108            2         100         200         21 axes1
    0.0            0.13            1         101         202         19 axes

    [...]

This truncated example output shows that the @tau_rewrite macro instruments not only
the rewrite_example function, but also all functions within the script itself
as well as all library functions called.


-----------------------------------------------------

Example 5: MPI.jl

TAU can be used to profile multi-rank applications which use MPI.jl.

To do this:

    - Ensure that MPI.jl is installed and is configured to run multi-rank
    applications in your environment. Because TAU must be built with the same
    MPI as MPI.jl, it is recommended to configure MPI.jl to use a system-provided MPI
    following the instructions at https://juliaparallel.org/MPI.jl/stable/configuration/#using_system_mpi

    - Build TAU with MPI support. 
      
      First, ensure that the same MPI that MPI.jl is configured to use is loaded into your environment
      and then build TAU with:

          ./configure -bfd=download -dwarf=download -pthread -ittnotify -julia -mpi
          make install

The file MPI_Uninstrumented.jl contains a simple MPI.jl example.

The example can be run with:

    tau_julia -T serial,julia,ittnotify,pthread -- --project=. ./src/MPI_Uninstrumented.jl

This will collect profiles showing MPI calls captured through TAU's MPI wrapper.

The file MPI_RewriteInst.jl contains the same example, but wrapped in a function which is instrumented
with the IR rewriter. This example can be run similarly:

    tau_julia -T serial,julia,ittnotify,pthread -- --project=. ./src/MPI_RewriteInst.jl

And the profile viewed with `pprof`:

    pprof


	NODE 0;CONTEXT 0;THREAD 0:
	---------------------------------------------------------------------------------------
	%Time    Exclusive    Inclusive       #Call      #Subrs  Inclusive Name
				msec   total msec                          usec/call
	---------------------------------------------------------------------------------------
	100.0         0.12       17,391           1           1   17391022 .TAU application
	100.0       17,345       17,390           1          40   17390902 taupreload_main
	0.1            3           23           1          21      23035 main
	0.1           20           20           1           2      20130 MPI_Init_thread()
	0.1            4           10           4          12       2736 kwcall
	0.0         0.02            5           2          16       2944 diff_names
	0.0            3            5          33          12        176 getindex
	0.0        0.961            2           2           2       1302 Array{Float64}
	0.0            1            2           1           2       2023 MPI_Finalize()
	0.0        0.885            1           4           4        426 Colon()

    [...]

This truncated output shows that MPI calls (e.g., MPI_Init_thread) and Julia calls are
included in the profile.


-----------------------------------------------------

Example 6: CUDA Uninstrumented

TAU can be used to profile Julia applications which use CUDA.jl.

To do this:

    - Ensure that CUDA.jl is installed and functional in your environment.
      Because TAU must be built with the same version of CUDA as CUDA.jl,
      it is recommended to configure CUDA.jl to use the system CUDA
      rather than a downloaded binary.
      Follow the instructions at https://cuda.juliagpu.org/stable/installation/overview/#Using-a-local-CUDA
      to configure CUDA.jl to use the local CUDA install

    - Build TAU with CUDA support:

          ./configure -bfd=download -dwarf=download -pthread -ittnotify -julia -cuda=/path/to/cuda
          make install

The file CUDA_Uninstrumented.jl contains a simple CUDA.jl example.

The example can be run with:

    tau_julia -T serial,julia,ittnotify,pthread,cupti -cupti -ittnotify -- --project=. ./src/CUDA_Uninstrumented.jl

This will collect profiles showing CUDA kernels collected through TAU's CUPTI support.

-----------------------------------------------------

Example 7: CUDA IR Rewriting

The file CUDA_RewriteInst.jl demonstrates using the `@tau_rewrite` macro with
CUDA.jl code. It creates a GPU array, performs an element-wise `sin` operation
on the GPU, and transfers the result back to the CPU.

Because the IR rewriter would otherwise recurse into GPU compiler internals,
the `tau_rewrite_exclude_module` function is used to exclude the modules
`:Base`, `:GPUArrays`, `:GPUCompiler`, and `:Cthulhu` from rewriting:

    tau_rewrite_exclude_module(:Base, :GPUArrays, :GPUCompiler, :Cthulhu)
    @tau_rewrite main()

To use this example, TAU must be built with CUDA support:

    ./configure -bfd=download -dwarf=download -pthread -ittnotify -julia -cuda=/path/to/cuda
    make install

The example can be run with:

    tau_julia -T serial,julia,ittnotify,pthread,cupti -cupti -ittnotify -- --project=. ./src/CUDA_RewriteInst.jl

-----------------------------------------------------

Example 8: Selective Rewriting

The file RewriteSelective.jl demonstrates using `tau_rewrite_exclude_module` to
prevent the IR rewriter from recursing into functions defined in the `Base`
module. This is useful when you want to profile only your own code and avoid
the overhead and noise of instrumenting all of Julia's standard library.

    tau_rewrite_exclude_module(:Base)
    @tau_rewrite rewrite_example()

The example contains user-defined functions (`fibonacci`, `recursive_example`,
`my_sum`, `rewrite_example`) which will be instrumented, while calls into
`Base` functions like `sum` and `collect` will not be rewritten.

The example can be run with:

    tau_julia -T serial,julia,ittnotify,pthread -- --project=. ./src/RewriteSelective.jl

-----------------------------------------------------

Example 9: Array Operations with IR Rewriting

The file ArrayOps_RewriteInst.jl demonstrates using `@tau_rewrite` to profile
code that performs array operations including random array generation,
`maximum`, `mapslices`, slicing, sorting, and broadcasting. This example
exercises Julia's array library more heavily than the basic RewriteInst.jl
example.

The recursion depth limit for rewriting can be configured using
`tau_rewrite_set_recursion_limit`. A global limit and per-module limits
can be set independently:

    tau_rewrite_set_recursion_limit(4)
    tau_rewrite_set_recursion_limit(Base, 2)
    @tau_rewrite main()

The example can be run with:

    tau_julia -T serial,julia,ittnotify,pthread -- --project=. ./src/ArrayOps_RewriteInst.jl

-----------------------------------------------------

Example 10: Multi-threaded IR Rewriting

The file Threads_RewriteInst.jl demonstrates using `@tau_rewrite` to profile
multi-threaded Julia code that uses `Base.Threads.@spawn` to distribute work
across threads. The example spawns 100 worker tasks, each performing a
computation, distributed across all available Julia threads.

Because threads may execute instrumented functions that have not yet been
seen on that thread, the `tau_rewrite_deferred_contexts` function is called
to enable deferred context creation for TAU timers:

    tau_rewrite_exclude_module(Base)
    tau_rewrite_deferred_contexts(true)
    @tau_rewrite main()

To run with multiple threads, set the `JULIA_NUM_THREADS` environment
variable or pass the `-t` flag to Julia:

    tau_julia -T serial,julia,ittnotify,pthread -- -t 4 --project=. ./src/Threads_RewriteInst.jl

