Compilation problem with CUDA, CMake, VS2012 on Windows

Front page Forums Installation Compilation problem with CUDA, CMake, VS2012 on Windows

This topic contains 17 replies, has 4 voices, and was last updated by  Dimitar 3 years, 8 months ago.

Viewing 15 posts - 1 through 15 (of 18 total)
  • Author
    Posts
  • #639

    gd
    Participant

    Hi,
    I’m trying to compile Paralution 0.6 on windows (CUDA=yes, OpenCL=no, MKL=no, MIC=no). But after the VS projects and solution are generated I can’t find any VS project files containing CUDA sources (those *.cu located in the src/base/gpu). How are they supposed to be compiled?

    Also when compiling on Windows, host_affinity.cpp fails since it uses #include which is not available on Windows (or only via something like https://code.google.com/p/libpthread/).

    #640

    Dimitar
    Member

    Hi,

    The 0.6.0 is the first version which supports Windows. For the moment it provides only OpenMP support (no CUDA/OpenCL/MIC).

    The functions in the host_affinity.cpp should be ignored for Windows OS. – there is #ifdef for disabling that.

    The CUDA support for VS will come probably in the next release (0.7.0), which I hope will be in April. However, we have couple of users which use PARALUTION with VS and CUDA – if you want to try by yourself you need to compile the library with define SUPPORT_CUDA and nvcc compiler (for the .cu files). You can also follow the cmake/Makefile for that.

    // Dimitar

    #641

    gd
    Participant

    Thanks, Dimitar! I’ve figured out how to compile CG example with CUDA as you suggested, by defining SUPPORT_CUDA. Then created a new CUDA 5.5 project (backend_gpu, static lib) in VS2012 and included all files from src/base/gpu/. Then created a new CUDA exe project, based on cg.cpp. Turned off preconditioner.

    The result performance is quite bad on my entry level NVidia K1000M.

    Not sure it this is expected, according to the specifications, even with this GPU it should be faster … but of course FLOPS is not the only difference … :

    CPU i7-3630QM: 49 GFLOPS – 3.9 sec (1.1 sec on 4 cores)
    GPU NVIDIA K1000M: 326.4 GFLOPS – 11.48 sec!

    ———-
    >cg_VS12.exe bcsstk12.mtx

    PARALUTION ver 0.6.0
    PARALUTION platform is initialized
    Accelerator backend: None
    OpenMP threads:1
    ReadFileMTX: filename=bcsstk12.mtx; reading…
    ReadFileMTX: filename=bcsstk12.mtx; done
    LocalMatrix name=bcsstk12.mtx; rows=1473; cols=1473; nnz=34241; prec=64bit; format=CSR; host backend={CPU(OpenMP)}; accelerator backend={None}; current=CPU(OpenMP)
    CG (non-precond) linear solver starts
    IterationControl criteria: abs tol=1e-015; rel tol=1e-006; div tol=1e+008; max iter=1000000
    IterationControl initial residual = 38.3797
    IterationControl RELATIVE criteria has been reached: res norm=3.81716e-005; rel val=9.94579e-007; iter=24927
    CG (non-precond) ends
    Solver execution:2.97574 sec

    =======================================

    >cg_gpu.exe bcsstk12.mtx

    Number of GPU devices in the system: 1
    PARALUTION ver 0.6.0
    PARALUTION platform is initialized
    Accelerator backend: GPU(CUDA)
    OpenMP threads:1
    Selected GPU devices: 0
    ————————————————
    Device number: 0
    Device name: Quadro K1000M
    totalGlobalMem: 2048 MByte
    clockRate: 850500
    compute capability: 3.0
    ECCEnabled: 0
    ————————————————
    ReadFileMTX: filename=bcsstk12.mtx; reading…
    ReadFileMTX: filename=bcsstk12.mtx; done
    LocalMatrix name=bcsstk12.mtx; rows=1473; cols=1473; nnz=34241; prec=32bit; format=CSR; host backend={CPU(OpenMP)}; accelerator backend={GPU(CUDA)}; current=GPU(CUDA)
    CG (non-precond) linear solver starts
    IterationControl criteria: abs tol=1e-015; rel tol=1e-006; div tol=1e+008; max iter=1000000
    IterationControl initial residual = 38.3797
    IterationControl RELATIVE criteria has been reached: res norm=3.43041e-005; rel val=8.9381e-007; iter=32379
    CG (non-precond) ends
    Solver execution:11.4747 sec

    #642

    Dimitar
    Member

    Great! So now you have the CUDA backend with VS!

    Your GPU should be ok. The bad performance is attributed to the fact that the input matrix is quite small. In this case the CPU can cache most of the memory access. You need to try larger problem – try https://www.paralution.com/downloads/Laplace2D4M.mtx.gz this is a 4M Laplace problem. You can also try larger matrices from The University of Florida Sparse Matrix Collection http://www.cise.ufl.edu/research/sparse/matrices/

    In general you can expect some speed-up when you solve problems larger than 1M unknowns.

    Best,
    Dimitar

    • This reply was modified 3 years, 9 months ago by  Dimitar.
    #644

    gd
    Participant

    Thanks for the links, I’ve tried with a bit more realistic example: http://www.cise.ufl.edu/research/sparse/matrices/ACUSIM/Pres_Poisson, 715804 nonzeros, the results are (double, MultiColoredILU):

    CPU i7-3630QM, 1 core: 1.6 sec
    CPU i7-3630QM, 4 core: 0.76 sec
    CPU i7-4900MQ, 1 core: 1.3 sec
    CPU i7-4900MQ, 4 core: 0.5 sec
    GPU NVIDIA Quadro K1000M: 4.3 sec
    GPU NVIDIA GeForce GTX780M: 2.73 sec

    Is there any way to see where the time is spent and maybe find out what I’m doing wrong, or only using profiler?

    #645

    Dimitar
    Member

    This matrix is still too small, only 14k unknowns (with quite a lot of nnz per row).

    Try https://www.cise.ufl.edu/research/sparse/matrices/AMD/G3_circuit.html
    or http://www.cise.ufl.edu/research/sparse/matrices/McRae/ecology2.html

    These two should work ok. Let me know if this is not the case.

    #694

    Wonder
    Participant

    First of all – great thanks to the author! Great work.

    I primary use 3 solvers for my atmospere modelling aplication: BiCGStab + ILU(1) for pressure, BiCGStab + MultiColoredGS for velocities and GMRES + MultiColoredILU(0,1) for temperature.

    I have notebook with core i3 3120m (2.5 GHz, 2 cores, 4HT) + NVidia GeForce 730M (Kepler, compute 3.0) + Intel HD Graphics 4000 (just for fun)
    After couple of days ;) i managed to compile Paralution 0.6.1 on my Windows 8.1 in this combinations.

    32 bit version
    1. MSVS 2013 + Intel Composer XE 2013 SP1 – works fine both in MKL and OpenMP. No accelerator backend. The best results were with no OpenMP with SUPPORT_MKL enabled. In logs Paralution shows OpenMP enabled but didn’t use it. That’ ok.
    OpenMP version with or without MKL works too but no sense due to bad BiCGStab + ILU(1) scaling (see below for details).
    I used Intel C++ compiler

    2. MSVS 2012 + CUDA 6.0 beta It was rather tricky as Paralution only supports CUDA 5.5. But after some project options editing I make it works. Compiled only with Microsoft C++ compiler as Intel one cause link errors.
    With my weak mobile GeForce 730M results are slightly better then on cpu.
    BiCGStab + MultiColoredGS and GMRES + MultiColoredILU(0,1) scaled WELL but BiCGStab + ILU(1) not. Preparing ILU preconditioner takes a lot of RAM and uses only 1 CPU, not GPU. But it’s ok – only need to wait for one time, cause my matrix for pressure is the same during simulation.
    Solving takes all power from multicore CPU or GPU but the time is similar with single threaded CPU setup. It’s quite dissapointed as it’s the most time consuming task and scales badly.

    3. MSVS + Intel OpenCL SDK 3.0 – was tricky to compile without MAKE utility, but i did it with manual compile and run ocl_check_hw.cpp to make HardwareParameters.hpp and ocl_write_kernels.cpp.
    Note that in file paths.h.in

    #define kernelPath “@PROJECT_SOURCE_DIR@/src/base/ocl/”
    #define utilsPath “@PROJECT_SOURCE_DIR@/src/utils/”

    i just change #define utilsPath “” and got HardwareParameters.hpp in current dir, then manualy copy it. Setting path to my actual Paralution installations caused compiler errors, some mess with incorrect char symbols with such hardcoded #define style.

    I then managed to run tests and my aplication on my GeForce 730 and Core i3 via OpenCL but OpenCL implementation of BiCGstab + ILU(1) cannot do solving part on accelerator (CUDA works!) – warning messages about doing on host, only 30% load on GPU and bad perfomance,

    With settings for Intel HD Graphics OpenCL system did not initialize.
    But it worked fine on my system with thirdparty OpenCL programs like Intel Fluid sample or cgminer ;)

    Hope it helps for further development.
    Hope in next release we will see full CUDA 6.0 and OpenCL support as well as fast OpenCL implementation of BiCGstab with ILU(1) preconditioner. I want to use my program with powerful AMD GPU system.

    Thanks for great work again,
    Buzalo Grigory aka Wonder

    #695

    Wonder
    Participant

    BTW – my system has now more than 10mln nnz for testing.
    I need much more but 32bit limited.
    Now trying to make 64 bit system work but have some problems … will write my progress then …

    #696

    ntr
    Participant

    Hey Wonder,
    thank you for your feedback!
    We hope to support CUDA 6.0 with the next release (0.7.0) of PARALUTION.
    Regarding the bad scalability of the ILU preconditioner, please keep in mind that the LUSolve function that is required for solving the preconditioning system is purely sequential. Therefore it is not expected to scale with multicore / GPU systems. If you want to use ILU based preconditioners on a multicore CPU / GPU, please use the MultiColoredILU preconditioner, which is based on a multi-coloring technique (check out our user manual for more details). It also supports OpenCL.

    Best regards, ntr

    • This reply was modified 3 years, 8 months ago by  ntr.
    #699

    Wonder
    Participant

    MulticoloredILU works very bad on my matrix for pressure system.

    Well … got some troubles before realised that your matrices and vectors can totaly destruct passed data.
    e.g.

    double *rhs=new double[n];
    for (int i = 0; i < n; ++i)
    {
    rhs[i] = 1.0;
    }
    LocalVector *rhsptr = new LocalVector;
    rhsptr->SetDataPtr(&rhs, “RHS”, n);
    rhsptr->info();

    delete[] rhsptr;

    My rhs will be also totaly destroyed.
    And this will also happened with automatic destuctors and so on …
    I think it’s not good to destoy memory that you didn’t allocate.

    I understand that SetDataPtr is a low-level utility but it is an ONLY way to pass data from simple double * vector.

    #700

    Wonder
    Participant

    One more question, sorry for bother you.
    Why there is no function for LocalVector.CopyFrom(const int n, const Value *src)? And similar CopyTo. And for matrices.
    It would be more convinient or maybe even more efficient in some cases – no need for extra memory allocations and extra local buffers.

    • This reply was modified 3 years, 8 months ago by  Wonder.
    • This reply was modified 3 years, 8 months ago by  Wonder.
    • This reply was modified 3 years, 8 months ago by  Wonder.
    #706

    ntr
    Participant

    Hey Wonder,
    You can get access to your rhs pointer by calling LeaveDataPtr function (this will destroy the rhsptr LocalVector). If you want to keep both, you will have to either allocate two double pointers and copy the data yourself, or clone another LocalVector before you use LeaveDataPtr.

    Best regards,
    ntr

    #707

    Dimitar
    Member

    Hi Wonder,

    Yes, as ntr wrote – on your first question – you need to get the data buffer back with LeaveDataPtr(). Then the destructor will not touch your data:
    like:
    rhs = NULL;
    rhsptr->LeaveDataPtr(&rhs);

    and then you can do
    delete[] rhsptr;

    On your second question: There is such function for CSR matrices called CopyFromCSR(const int *row_offsets, const int *col, const ValueType *val) and CopyToCSR(int *row_offsets, int *col, ValueType *val);

    Cheers,
    Dimitar

    #709

    Wonder
    Participant

    OK, I see, let it be architecture “feature”. I already make it work allocating additional buffers and memcpying them.

    I moved to 64 bit version and now I can work with ILU with more then 30mln non- zeroes! Quite impressive. And my Geforce 730M is twice as fast for solving BiCGStab+ILU(1)then CPU – preconditioner is built only once and then only solving as matrix is constant, only RHS is changed.
    As I stated before OpenMP perfomance is bad (yes, i excluded extra hyper treading core). In fact without OpenMP I got similar timings. So I switched it off.

    Another bunch of questions ;) ;) ;)

    I need on every time step to solve sparse linear system with 3 rhs vectors.
    What would be the best option?
    To call 3 times Solve after setting up Solver

    // all necessary preparations
    ls.Build();
    ls.Solve(rhs1, &x1);
    ls.Solve(rhs2, &x2);
    ls.Solve(rhs3, &x3);

    But in this case it will not be parallel calculation?
    Or if i have different sparse linear systems to solve that can be solved independently?
    Maybe if i make several independed Paralution solvers and will run in in parallel via
    #pragma omp parallel sections
    #pragma omp section
    //call solver1 for system1 on cpu
    #pragma omp section
    //call solver2 for system2 on cpu
    #pragma omp section
    //call solver3 for system3 on gpu

    Will it work? Is Paralution thread safe to be called from parallel environment?

    • This reply was modified 3 years, 8 months ago by  Wonder.
    • This reply was modified 3 years, 8 months ago by  Wonder.
    • This reply was modified 3 years, 8 months ago by  Wonder.
    #715

    Wonder
    Participant

    One more problem.
    I have a working config, it works well.
    Then I try to enable #define PARALUTION_CUDA_PINNED_MEMORY
    in C:\paralution-0.6.1\src\utils\allocate_free.hpp

    Compiled it with no problem, BUT got on runtime this:
    Cuda error: invalid argument
    File: C:/paralution-0.6.1/src/base/gpu/gpu_allocate_free.cu; line: 72

    Maybe it is CUDA 6.0 incompatibility or am I doing something wrong? ;)

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic.