2,377 questions
1
vote
1
answer
72
views
sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result
I want to use the following code to compute the intersection of array a and array b:
#include <nmmintrin.h>
#include <cstdint>
#include <cstdio>
void test(uint16_t *a, uint16_t *b) {
...
1
vote
0
answers
73
views
Run TensorFlow 2.17 on CPU without AVX
I would like to install tensorflow on a Windows system with a processor that does not seem to support AVX (Pentium J6426).
I saw that a minority of people had this problem in the past with earlier ...
3
votes
1
answer
113
views
How might I optimize computing a large bilinear function exhibiting more-or-less random access?
I have a function which takes two arrays and fills a third array with bilinear combinations of the coordinates of the two arrays. The function is a clifford product over an 8-dimensional real vector ...
0
votes
1
answer
79
views
Access violation when performing matrix product using SIMD in Rust
I'm making my own linalg library for my opengl project, and was thinking of accelerating matmul using simd.
minimal reproducible example:
use std::arch::x86_64::*;
#[derive(Debug, Clone, Copy)]
...
3
votes
1
answer
174
views
What is the point of MOVAPS in x86 if it does the same as MOVUPS in modern computers?
I was coding a memset function in an embedded system, and I found the fastest way was using movups. Given my memory was already aligned, I decided to use movaps to get faster & smaller results. ...
0
votes
1
answer
85
views
Structure of SSE vectorization calls for summing vector of floats
This question was brought up by the recent question Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?. Not having delved into SSE intrinsic, the question arose, "How ...
28
votes
2
answers
3k
views
Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?
I've recently been diving deeper into x86-64 architecture and exploring the capabilities of SSE and AVX. I attempted to write a simple vector addition function like this:
void compute(const float *a, ...
2
votes
1
answer
106
views
What does the "i" in COMISS / VUCOMISS stand for?
Currently I'm reading CS:APP 3rd edition, and I found the instructions a little bit verbose (in my view) like vucomiss, so I looked for the full name of the instruction to help memorizing.
I found the ...
6
votes
1
answer
247
views
Why do modern compilers prefer SSE over FPU for single floating-point operations
I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code:
float ...
2
votes
0
answers
95
views
Is there a better way to load and unload data to and from an aligned memory location in C?
Here's the code for a working simple program that multiplies two (although here same) 16-byte float vectors through SSE and storing the output into s in C.
#include <xmmintrin.h>
float *data;
...
0
votes
0
answers
105
views
Request exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
I am making a Python app to fetch data from my website (which is written in PHP) as a stream.
main.py
import requests
from sseclient import SSEClient
def get_sse(url):
try:
# Make the ...
2
votes
1
answer
228
views
Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication
I've implemented a quaternion-vector multiplication function using SIMD instructions, with conditional compilation for AVX512, AVX2, and SSE. While I expected to see significant performance ...
0
votes
1
answer
86
views
Not getting performance improvement with AVX in comparison with SSE
I am trying to utilise SIMD capabilities of processor. However, in case of vectorisation I observed that there is no improvement while compiling binary for AVX(cmake flag -mavx2) when compared with ...
0
votes
1
answer
62
views
Why CSAPP say Gcc do not use vcvtss2sd?
Computer Systems: A Programmer's Perpective (3rd), in section 3.11.1, say "Suppose the low-order 4 bytes of %xmm0 hold a single-precision value; then it would seem straightforward to use the ...
3
votes
1
answer
145
views
Twice as slow SIMD performance without extra copy
I've been optimizing some code, and stumbled across some peculiar case.
Here are the two assembly codes:
; FAST
lea rcx,[rsp+50h]
call qword ptr [Random_get_float3] ;this function ...