Skip to main content
1 vote
1 answer
72 views

sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result

I want to use the following code to compute the intersection of array a and array b: #include <nmmintrin.h> #include <cstdint> #include <cstdio> void test(uint16_t *a, uint16_t *b) { ...
zelin's user avatar
  • 21
1 vote
0 answers
73 views

Run TensorFlow 2.17 on CPU without AVX

I would like to install tensorflow on a Windows system with a processor that does not seem to support AVX (Pentium J6426). I saw that a minority of people had this problem in the past with earlier ...
Alkhwarizmi's user avatar
3 votes
1 answer
113 views

How might I optimize computing a large bilinear function exhibiting more-or-less random access?

I have a function which takes two arrays and fills a third array with bilinear combinations of the coordinates of the two arrays. The function is a clifford product over an 8-dimensional real vector ...
MonadMania's user avatar
0 votes
1 answer
79 views

Access violation when performing matrix product using SIMD in Rust

I'm making my own linalg library for my opengl project, and was thinking of accelerating matmul using simd. minimal reproducible example: use std::arch::x86_64::*; #[derive(Debug, Clone, Copy)] ...
alco's user avatar
  • 1
3 votes
1 answer
174 views

What is the point of MOVAPS in x86 if it does the same as MOVUPS in modern computers?

I was coding a memset function in an embedded system, and I found the fastest way was using movups. Given my memory was already aligned, I decided to use movaps to get faster & smaller results. ...
Egemen Yalın's user avatar
0 votes
1 answer
85 views

Structure of SSE vectorization calls for summing vector of floats

This question was brought up by the recent question Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?. Not having delved into SSE intrinsic, the question arose, "How ...
David C. Rankin's user avatar
28 votes
2 answers
3k views

Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?

I've recently been diving deeper into x86-64 architecture and exploring the capabilities of SSE and AVX. I attempted to write a simple vector addition function like this: void compute(const float *a, ...
nowox's user avatar
  • 28.9k
2 votes
1 answer
106 views

What does the "i" in COMISS / VUCOMISS stand for?

Currently I'm reading CS:APP 3rd edition, and I found the instructions a little bit verbose (in my view) like vucomiss, so I looked for the full name of the instruction to help memorizing. I found the ...
user27532204's user avatar
6 votes
1 answer
247 views

Why do modern compilers prefer SSE over FPU for single floating-point operations

I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code: float ...
HiroIshida's user avatar
  • 1,603
2 votes
0 answers
95 views

Is there a better way to load and unload data to and from an aligned memory location in C?

Here's the code for a working simple program that multiplies two (although here same) 16-byte float vectors through SSE and storing the output into s in C. #include <xmmintrin.h> float *data; ...
Ayush's user avatar
  • 1,359
0 votes
0 answers
105 views

Request exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I am making a Python app to fetch data from my website (which is written in PHP) as a stream. main.py import requests from sseclient import SSEClient def get_sse(url): try: # Make the ...
blazgocompany's user avatar
2 votes
1 answer
228 views

Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication

I've implemented a quaternion-vector multiplication function using SIMD instructions, with conditional compilation for AVX512, AVX2, and SSE. While I expected to see significant performance ...
HiroIshida's user avatar
  • 1,603
0 votes
1 answer
86 views

Not getting performance improvement with AVX in comparison with SSE

I am trying to utilise SIMD capabilities of processor. However, in case of vectorisation I observed that there is no improvement while compiling binary for AVX(cmake flag -mavx2) when compared with ...
Sachendra Singh's user avatar
0 votes
1 answer
62 views

Why CSAPP say Gcc do not use vcvtss2sd?

Computer Systems: A Programmer's Perpective (3rd), in section 3.11.1, say "Suppose the low-order 4 bytes of %xmm0 hold a single-precision value; then it would seem straightforward to use the ...
TouXianGuan's user avatar
3 votes
1 answer
145 views

Twice as slow SIMD performance without extra copy

I've been optimizing some code, and stumbled across some peculiar case. Here are the two assembly codes: ; FAST lea rcx,[rsp+50h] call qword ptr [Random_get_float3] ;this function ...
Alex's user avatar
  • 584

15 30 50 per page
1
2 3 4 5
159