Newest 'sse' Questions - Stack Overflow

1 vote

1 answer

72 views

sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result

I want to use the following code to compute the intersection of array a and array b: #include <nmmintrin.h> #include <cstdint> #include <cstdio> void test(uint16_t *a, uint16_t *b) { ...

zelin

21

asked Nov 12 at 9:58

1 vote

0 answers

73 views

Run TensorFlow 2.17 on CPU without AVX

I would like to install tensorflow on a Windows system with a processor that does not seem to support AVX (Pentium J6426). I saw that a minority of people had this problem in the past with earlier ...

Alkhwarizmi

75

asked Nov 6 at 9:30

3 votes

1 answer

113 views

How might I optimize computing a large bilinear function exhibiting more-or-less random access?

I have a function which takes two arrays and fills a third array with bilinear combinations of the coordinates of the two arrays. The function is a clifford product over an 8-dimensional real vector ...

MonadMania

119

asked Oct 24 at 0:20

0 votes

1 answer

79 views

Access violation when performing matrix product using SIMD in Rust

I'm making my own linalg library for my opengl project, and was thinking of accelerating matmul using simd. minimal reproducible example: use std::arch::x86_64::*; #[derive(Debug, Clone, Copy)] ...

alco

1

asked Oct 20 at 8:13

3 votes

1 answer

174 views

What is the point of MOVAPS in x86 if it does the same as MOVUPS in modern computers?

I was coding a memset function in an embedded system, and I found the fastest way was using movups. Given my memory was already aligned, I decided to use movaps to get faster & smaller results. ...

Egemen Yalın

33

asked Oct 8 at 17:50

0 votes

1 answer

85 views

Structure of SSE vectorization calls for summing vector of floats

This question was brought up by the recent question Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?. Not having delved into SSE intrinsic, the question arose, "How ...

David C. Rankin

84.3k

asked Sep 30 at 8:25

28 votes

2 answers

3k views

Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?

I've recently been diving deeper into x86-64 architecture and exploring the capabilities of SSE and AVX. I attempted to write a simple vector addition function like this: void compute(const float *a, ...

nowox

28.9k

asked Sep 29 at 21:02

2 votes

1 answer

106 views

What does the "i" in COMISS / VUCOMISS stand for?

Currently I'm reading CS:APP 3rd edition, and I found the instructions a little bit verbose (in my view) like vucomiss, so I looked for the full name of the instruction to help memorizing. I found the ...

user27532204

21

asked Sep 29 at 10:55

6 votes

1 answer

247 views

Why do modern compilers prefer SSE over FPU for single floating-point operations

I recently tried to read assemblies of the binary of my code and found that a lot of floating-point operations are done using XMM registers and SSE instructions. For example, the following code: float ...

HiroIshida

1,603

asked Sep 11 at 11:52

2 votes

0 answers

95 views

Is there a better way to load and unload data to and from an aligned memory location in C?

Here's the code for a working simple program that multiplies two (although here same) 16-byte float vectors through SSE and storing the output into s in C. #include <xmmintrin.h> float *data; ...

Ayush

1,359

asked Sep 10 at 13:23

0 votes

0 answers

105 views

Request exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

I am making a Python app to fetch data from my website (which is written in PHP) as a stream. main.py import requests from sseclient import SSEClient def get_sse(url): try: # Make the ...

blazgocompany

11

asked Sep 4 at 23:15

2 votes

1 answer

228 views

Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication

I've implemented a quaternion-vector multiplication function using SIMD instructions, with conditional compilation for AVX512, AVX2, and SSE. While I expected to see significant performance ...

HiroIshida

1,603

asked Sep 4 at 13:50

0 votes

1 answer

86 views

Not getting performance improvement with AVX in comparison with SSE

I am trying to utilise SIMD capabilities of processor. However, in case of vectorisation I observed that there is no improvement while compiling binary for AVX(cmake flag -mavx2) when compared with ...

Sachendra Singh

1

asked Sep 4 at 6:09

0 votes

1 answer

62 views

Why CSAPP say Gcc do not use vcvtss2sd?

Computer Systems: A Programmer's Perpective (3rd), in section 3.11.1, say "Suppose the low-order 4 bytes of %xmm0 hold a single-precision value; then it would seem straightforward to use the ...

TouXianGuan

3

asked Jul 20 at 4:27

3 votes

1 answer

145 views

Twice as slow SIMD performance without extra copy

I've been optimizing some code, and stumbled across some peculiar case. Here are the two assembly codes: ; FAST lea rcx,[rsp+50h] call qword ptr [Random_get_float3] ;this function ...

Alex

584

asked Jul 19 at 8:54

Collectives™ on Stack Overflow

sse4.2 _mm_cmpistrm/_mm_cmpestrm instruction get wrong result

Run TensorFlow 2.17 on CPU without AVX

How might I optimize computing a large bilinear function exhibiting more-or-less random access?

Access violation when performing matrix product using SIMD in Rust

What is the point of MOVAPS in x86 if it does the same as MOVUPS in modern computers?

Structure of SSE vectorization calls for summing vector of floats

Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?

What does the "i" in COMISS / VUCOMISS stand for?

Why do modern compilers prefer SSE over FPU for single floating-point operations

Is there a better way to load and unload data to and from an aligned memory location in C?

Request exception: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Small performance gain using AVX512 over SSE in batch quaternion-vector multiplication

Not getting performance improvement with AVX in comparison with SSE

Why CSAPP say Gcc do not use vcvtss2sd?

Twice as slow SIMD performance without extra copy

Hot Network Questions

Collectives™ on Stack Overflow

Related Tags