|Re: [eigen] Optimizing Eigen::Tensor operations for tiny level 1 cache|
[ Thread Index |
| More lists.tuxfamily.org/eigen Archives
- To: eigen@xxxxxxxxxxxxxxxxxxx
- Subject: Re: [eigen] Optimizing Eigen::Tensor operations for tiny level 1 cache
- From: Pete Blacker <pete.blacker@xxxxxxxxx>
- Date: Tue, 28 May 2019 20:04:57 +0100
- Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=e4pYX8anXpE6SdYZK8xp7nHbgD8Fhxduv2vqQVOFFM0=; b=jN47nD9PRNpoO4ndGwYF9pvca4E10+nUr8kLwEFcEd47vYEZ59pH7aKUza2aum1C/S V21TkgMm3Jg8eeIiDt6/f4SpT5Cbo7NrR3/mieBp4546Ab6LDbwSfZuyio3Z3SjPGLUy 9bnX0Xt4fMlCobm9NZ8mRqCEmb+D5buJK1lYkfLfJQ7DJI5kiAlabD2yHm4QNGu/FzMZ ivz7lAFG5o6RCq5v7T0Zx2+cgYXj0AbCMkloZfWX1apZPk7l3HrjYAUefSUfKRAHuCtk 7Yaw/BH81uEOtSQOTk0zjuxTiCR8azhbdJi9pFPDxyoxzow1Xzx3Rt8kwAKQEshfC5ky wo0g==
Thanks for the detailed reply Rasmus. I'll look into these points this week.
The way to optimize the tensor library for hardware with limited cache sizes would be to
1. Reduce the size of the buffer used for the ".block()" interface. I believe we currently try to fit them in L1, but perhaps the detection doesn't work correctly on your hardware..
2. Reduce the block sizes used in TensorContraction.
1. By default the blocksize is chosen such that the blocks fits in L1:
Each evaluator in an _expression_ reports how scratch memory it needs to compute a block's worth of data through the getResourceRequirements() API, e.g.:
These values are then merged by the the executor in the calls here:
2. The tensor contraction blocking uses a number of heuristics to choose block sizes and level of parallelism. In particular, it tries to pack the lhs into L2, and rhs into L3.
I hope these pointers help.
I'm currently using the Eigen::Tensor module on a relatively small processors which has very limited cache, 16KB level 1 no level 2 at all! I've been looking for any way to optimise the blocking of operations performed by Eigen for a particular block size but I can't find anything so far.
Is there a way to optimise the Tensor operations for this type of small cache?