Solving Ax = b for multiple b vectors

Front page Forums Plug-ins Solving Ax = b for multiple b vectors

This topic contains 8 replies, has 2 voices, and was last updated by  Matthaios 2 years, 10 months ago.

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
  • #1313


    Dear Paralusion developers and community,

    I would like to solve a linear system of the form Ax = b for multiple b vectors and get multiple x solutions.

    Calling the mex function multiple times in not an efficient solution since the preconditioner and solver are build again. In theory the preconditioner and solver build, and the transfer of A to the accelerator, should happen only once, and the the solve function should be called multiple times for each b vector.

    Things get a bit more complicated because my data comes form Matlab, so I have to use the Matlab wrapper.
    I looked for a similar example in the documentation and in the code examples but I couldn’t not find one, please if there is an existing example (for the Matlab interface or regular C++ code) just point me to it and I will work it from there on.

    I tried to modify the code to do the following :
    1) call the mex once, passing matrix A[NxN] and the vector b [N*Q] (where Q is the number of vectors I want to solve for, calling it “numberSources” in the code)
    2)build the preconditioner & solver for A
    3)chop b in 1xN pieces
    4)solver for each piece
    5)create output vector x[N*Q]

    My code works, but I was expecting a huge speed up for not building the preconditioner&solver multiple times, however I got only a 50% speed increase, for a Q of ~300.
    My coding skills are not the best when it comes to C++ and I am just now becoming familiar with your library, so I am sure there is a more efficient way of doing it.

    I am attaching my code, any feedback and advice is welcome, please don’t skip the basics, I might not know them!
    (Please bare with me, I am just learning! :D )

    Thanks in advance.


    #include <paralution.hpp>
    #include “mex.h”

    void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[]) {

    // Check output argument
    if (nlhs != 1) {
    mexErrMsgTxt(“Expecting one output argument.”);

    // Check input arguments
    if (nrhs < 2) {
    mexErrMsgTxt(“Expecting at least two input arguments.”);

    // Check if input matrix and rhs are double precision
    if (!mxIsDouble(prhs[0]) || !mxIsSparse(prhs[0])) {
    mexErrMsgTxt(“Expecting sparse matrix in double precision.”);

    if (!mxIsDouble(prhs[1])) {
    mexErrMsgTxt(“Expecting rhs in double precision.”);

    double tol = 1e-6;
    mwSize maxIter = 10000;

    if (nrhs > 2) tol = mxGetScalar(prhs[2]);
    if (nrhs > 3) maxIter = mxGetScalar(prhs[3]);

    // Get matrix arrays
    mwIndex *mw_row = mxGetIr(prhs[0]);
    mwIndex *mw_col_ptr = mxGetJc(prhs[0]);
    double *mw_val = mxGetPr(prhs[0]);

    // Get matrix sizes
    int nrow = int(mxGetN(prhs[0]));
    int ncol = int(mxGetM(prhs[0]));
    int nnz = int(mw_col_ptr[nrow]);

    // Get vector size
    int vcol = int(mxGetM(prhs[1]));

    int *row = NULL;
    int *col_ptr = NULL;
    double *val = NULL;
    paralution::allocate_host(nnz, &val);
    paralution::allocate_host(nnz, &row);
    paralution::allocate_host(nrow+1, &col_ptr);

    // copy needed due to diff types
    for (int i=0; i<nnz; ++i) {
    row[i] = mw_row[i];
    val[i] = mw_val[i];

    for (int i=0; i<nrow+1; ++i)
    col_ptr[i] = mw_col_ptr[i];

    // Get rhs
    double *mw_rhs = mxGetPr(prhs[1]);
    double *rhs = NULL;

    paralution::allocate_host(vcol, &rhs);

    for (int i=0; i<vcol; ++i)
    rhs[i] = mw_rhs[i];

    // Allocate output vector
    plhs[0] = mxCreateDoubleMatrix(vcol, 1, mxREAL);
    double *sol = mxGetPr(plhs[0]);

    // Initialize PARALUTION

    // Create PARALUTION data structures
    paralution::LocalMatrix<double> A;
    paralution::LocalVector<double> b;
    paralution::LocalVector<double> x;

    // Fill PARALUTION data
    // For symmetric matrices CSC == CSR
    A.SetDataPtrCSR(&col_ptr, &row, &val, “A”, nnz, ncol, nrow);

    int numberSources = vcol/nrow;

    ////////////////// Solver
    paralution::BiCGStab<paralution::LocalMatrix<double>, paralution::LocalVector<double>, double> ls;

    ///////////////// Preconditioner
    paralution::IC<paralution::LocalMatrix<double>, paralution::LocalVector<double>, double> p;

    ls.Init(0.0, tol, 1e8, maxIter);

    // Build solver and preconditioner


    // Disable PARALUTION printing

    double* sourcePointer = NULL;
    int xi = 0;
    int bi = 0;
    for (int sources=0; sources < numberSources; sources++)
    // Copy one source vector – new
    sourcePointer = NULL;
    paralution::allocate_host(nrow, &sourcePointer);
    for (int i=0; i<nrow; ++i)
    sourcePointer[i] = mw_rhs[bi];

    x.Allocate(“x”, nrow);

    b.SetDataPtr(&sourcePointer, “b”, ncol);

    // Solve Ax=b
    ls.Solve(b, &x);

    double *ptr_sol = NULL;


    for (int i=0; i<nrow; ++i)
    sol[xi] = ptr_sol[i];







    Hi Matthaios,

    I have checked your code, and it looks quite ok to me. Of course, you could speed up the computation by not copying your rhs and x data, but directly passing the pointers to it. However, this will not work with flexible backends. If you use a GPU, for example, you would have to hack your own kernel for splitting up and handing over the data pointer.
    In any way, I would not free the solution and rhs host vector every iteration. You can simple reuse it. Also, I highly recommend you to not use the IC preconditioner on any parallel architectures since its implementation is sequentially by nature. Please have a look at the MultiColoredILU preconditioner. If your system is spd, you could also try FSAI preconditioning.
    We are actually considering an implementation of a solve function for multiple rhs. However, this is something for the future.
    Please keep us posted on your topic!

    Best regards,



    Hi Nico,

    I am using a GPU as accelerator with OpenCL backend, and indeed the copying must be taking lots of time, but I don’t know how to implement the technique you mention to pass directly the pointers. If there is an example of how to do this, please can you point me to it? I believe this would result in a significant speedup.

    I moved the memory freeing for b after the for loop, it works, but not really makes any computational time difference.
    Would it be possible to not re-allocated and move to accelerator x at every iteration(?), I think this takes a lot of time as well, and maybe could be avoided? Instead just overwrite x and use the same space? When I try to do this the solver fails when solving for the second b vector, I guess either the MoveToHost or the LeaveDataPtr completely removes the space allocated for x, so I have to reallocate x, move to accelerator and se to zeros, in every iteration (well, I might be doing something wrong there…?).

    Regarding the preconditioner, you mean move (solver/preconditioner and A) to accelerator first and then build it? This seems to be working slower for my case (A ~ [50kx50k]), but should be faster for very big dimensions of A because both solver and preconditioner building could benefit more from parallelization. I will try it out with bigger dimensions and let you know.

    Thanks for your response,
    Best wishes,



    Hi Matthaios,

    Regarding the preconditioner, I meant that you should switch from IC to MultiColoredILU, or in case of an spd system to FSAI (if possible).
    Yes, its possible to not re-allocate and move the data back and forth. You will need some basic knowledge on GPU programming to achieve this. Alternatively, you could hack a solve function within PARALUTION, that allows you to pass several right hand sides for solving. As I mentioned earlier, this is on our todo list for a future release.
    However, I highly doubt, that you can get speedups by using GPUs with a sparse 50k x 50k matrix. Did you try the host version?

    Also, please keep in mind, if you leave a data pointer from PARALUTION when the object is on the accelerator, you will not be able to access the data pointer on the host system and vice versa.

    Best regards,

    • This reply was modified 2 years, 10 months ago by  nico.
    • This reply was modified 2 years, 10 months ago by  nico.


    Hi Nico,

    Thanks for your response.

    Just a quick clarification/question.

    What problem is big enough to benefit from parallelism? I will solve problems in the range between 50k – 500k, will I see the difference?
    To take advantage of the parallel logic of a preconditioner do I have to move to accelerator first and then build?
    Or is building parallel in the host anyway ,using the CPU?

    I will try to do what you mentioned with the solver and probably bother you again with more questions when I have a better understanding of GPU programming.

    Again, thanks for all your feedback.

    Best wishes,



    Hey Matthaios,

    500k still seems to be quite low for a GPU, then again, this also depends on the underlying host CPU and host memory bandwidth, as well as the utilized GPU. In general, I would not expect too much.

    Basically, all your code is performed on the host system up to the point where you call the MoveToAccelerator() functions. If you build the system before, it will be built on the host, if you build it after the call, it will be built on the GPU (if GPU implementation is available, else it automatically falls back to the host, with a short warning).

    MultiColoredILU building process for example, is supported on the GPU when using CUDA. You can find details on the implemented backends in our user manual.

    Best regards,



    Hi Nico,

    About the A matrix size I mean 500kx500k, so that should be enough?

    Thanks for your answers, you have been very helpfull.

    Best wishes,



    Hi Matthaios,

    As mentioned before, 500k x 500k might be low for a GPU. However, this depends also a lot on the sparsity pattern and the GPU / CPU itself, how CPU cache is being used and connected, etc. I cannot give you a general rule, you will have to do some tests with your hardware and matrices to find the best possible solution.

    Best regards,

    • This reply was modified 2 years, 10 months ago by  nico.


    Hey Nico,

    Reducing sparsity bandwidth is great idea – definitely gonna try it out!

    Thanks for all your help!


Viewing 9 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic.