This is really dependant on how the software engineers have decide to implement sound in their engine. However it normally takes the form of caching the sounds in ram, which you then pass to the sound card to be played at the appropriate moment.
The human ear cannot distinguish sounds less than about 65ms apart, it follows that as long as the gun sound happens less than this time after you press your fire button you will think that the actions were simultaneous.
A .5 second sound at 44.1Khz and 16bit is 44100 bytes long. Most system busses are 32 bit wide, ie. send 32bits in parrallel. Consider a system buss running at 100MHz. This bus can deliver 32bits every 0.00000001 seconds. This means it would take 0.00011025s to transmit this data to the sound card. This is an order of 1000 times faster than needed in order for a human to consider the press of the button and the sound contigious.
So there is no problems with storing often used bits of data in ram. However the CPU will be involved here because it will be orchestrating what data goes where and when it goes there.
The software writers will devise a scheme that allows them to respond to the keypress for firing. This will start a chain of events which results in the data being passed to the sound card. It is up to the software writers to manage this process such that the sound happens at the appropriate moment, without affecting other elements of the game, like the screen. This has become easier over the history of computing because a lot of the sound and screen processing are now delegated to the sound and video cards.