Skip to main content


Beiträge die mit rust getaggt sind


On Learning Rust and Go: Migrating Away from Python

I first started using Python in 1993. It's been my main programming language since about 2000. I've written a ton of code in Python, both for work and in my free time. For several years now, I've been…
Article word count: 621

HN Discussion:
Posted by edward (karma: 22125)
Post stats: Points: 119 - Comments: 90 - 2019-03-24T08:44:34Z

\#HackerNews #and #away #from #learning #migrating #python #rust
Article content:

I first started using Python in 1993. Itʼs been my main programming language since about 2000. Iʼve written a ton of code in Python, both for work and in my free time. For several years now, Iʼve been growing dissatisfied with it. Partly itʼs because Iʼd like more help from my programming tools, such as static type checking, better handling of abstractions and code modules, and in general aiding me in writing larger, more complex software. Partly itʼs because Iʼm writing more challenging software, and trying to get more out of the hardware I have available. Partly itʼs because Iʼm not getting the feeling that the Python community is going in a direction I want to follow. Instead I get the feeling that the Python community is happy to cut corners and compromise on things that Iʼm not willing to. Which is fine, if it makes their lives better, but leaves me wanting something else.

I wrote Obnam, my backup application, in Python, over a period of about fourteen years, until I retired it a year ago. During Obnamʼs life, Python 3 happened. Python 3 is actually a good thing, I think, but the transition was painful, and Obnam never made it. Obnam had other issues, which made it less fun to work on; Python 3 wasnʼt what killed it.

Obnam and other large programs Iʼve written in Python gave me the strong feeling that itʼs a nice language up to a certain size and complexity of program. Obnam is about 15000 lines of Python code. That turned out to be too much for me in Python: too often there were bugs that a static, strong type system would have caught. I could perhaps have been more diligent in how I used Python, and more careful in how I structured my code, but thatʼs my point: a language like Python requires so much self-discipline that at some point it gets too much.

So over the last few months Iʼve been learning Rust and Go, on and off, in short gaps of free time between other duties. Both have static type systems that can be argued to be strong. Both seem to have decent module systems. Both seem to support concurrency well. Either should be a good replacement for Python for non-small software I write. But I expect to be using Rust for any non-work programming and Go only when work needs me to.

Rust is developed by a community, and was started by Mozilla. Go development seems to be de facto controlled by Google, who originated the language. Iʼd rather bet my non-work future on a language that isnʼt controlled by a huge corporation, especially one of the main players in todayʼs surveillance economy. I write code in my free time because itʼs fun, and I release it as free software because thatʼs the ethical thing to do. I feel quite strongly that software freedom is one of the corner stones for the long-term happiness for humanity.


Ignoring ethical concerns, Rust seems like the better language of the two, so far. It has the better type system, the better compiler, the better tooling in general, and seems to me to have learnt better the lessons of programming languages and tools of the past third of a century. Rust exhibits better taste: things are designed the way they are for good reasons. Itʼs not always as convenient or familiar as Go, but it doesnʼt seem to make compromises for short-term convenience the way Go does.

Note that Iʼve not written any significant code in either language, so Iʼm just writing based on what Iʼve learnt by reading. My opinions may change in the future, as I get more into both languages.

HackerNewsBot debug: Calculated post rank: 109 - Loop: 78 - Rank min: 100 - Author rank: 74

Handmade Rust Part 1: Introduction & Allocators by Steven Le Rouzic: #Rust #language a powerful schema migration builder's 0.5.0 release by Squirrel People: #Rust #crates

The Embedded Working Group Newsletter - 17 by The Embedded Working Group: #Rust #embedded

The last two months in rustsim #4 (January - February 2019) by Sébastien Crozet: #Rust #crates

Learning Rust With Entirely Too Many Linked Lists by Alexis Beingessner: #Rust #language

New cargo subcommand: sync-readme by Dimitri Sabadie: #Rust #tools

Building an Embedded Futures Executor II by Josh Robson Chase: #Rust #embedded

Rust: The Hard Parts - Part One: References and borrowing by Naftuli Kay: #Rust #starting

Building and augmenting libraries by calling Rust from JavaScript by Ryan Levick: #Rust #web
#Rust #web

Implementing a Hidden Markov Model in Rust by Paul Kernfeld: #Rust #crates

Refactoring Varisat: 3. Conflict Driven Clause Learning by Jannis Harder: #Rust #compsci


Implementing a NES Emulator in Rust

Recently, I made an emulator for the Nintendo Entertainment Console(NES) - a game console first released in 1983. In this article, I’ll talk about how I used Rust to develop the emulator. I’ll cover…
Article word count: 3951

HN Discussion:
Posted by MichaelBurge (karma: 2247)
Post stats: Points: 154 - Comments: 41 - 2019-03-19T13:18:55Z

\#HackerNews #emulator #implementing #nes #rust
Article content:

Recently, I made an [1]emulator for the Nintendo Entertainment Console(NES) - a game console first released in 1983.

In this article, I’ll talk about how I used [2]Rust to develop the emulator. I’ll cover questions like:
\* What features does the emulator support? What games can it play? 
 \* How did I approach the problem of emulating the NES? 
 \* Did Rust’s type system or borrow checker interfere? Were there performance issues?

Table of Contents:


Super Mario Bros is beatable on my emulator:


\* Runs at a stable 60 FPS(can go up to 430 FPS in “headless” mode) 
 \* Use an Xbox 360 controller for game input 
 \* Savestates allow you to save at any time. Impress your friends with flawless Goomba-stomping 
 \* Video recording works with savestates to remember a single chain of controller inputs.

Games tested:
\* Donkey Kong 
 \* Super Mario Bros

There are a few remaining tasks that would make it more usable:
\* Support for more [4]mappers will allow it to play more games 
 \* The keyboard should be usable as an input device 
 \* The audio sounds different, though I don’t know the words to describe the problem

Rust seems like it was a fine choice for this project. Using features like iterators and traits generally didn’t slow down the program(though see below for one exception). Since an NES is a fixed hardware device, no dynamic allocation^[5]1 should be necessary and Rust makes it easy to reason about already-allocated memory with its ownership model.

People frequently want to run emulators on [6]strange embedded systems. C is probably still a better choice for those cases: It’s more difficult to find someone who can implement a Rust compiler than a C compiler.

But Rust seems usable in any project where C++ is a viable language.


I emulated the NES mostly by emulating each individual component separately. These include:

These components are either Clocked or are mapped to one of two Address Spaces. I define each component as a [7]C Struct, and implement one of two [8]Traits to specify how it interacts with other components.

For video/audio/controller IO with the host OS, I used the [9]SDL library.

Here is a table with all structures in my program. They generally correspond to actual hardware components like RAM or internal timers.
Structure Name      Component Clocked? Address Space?                                              Description

[10]Nes T F Top-level hardware component. Has all others as members
[11]C6502 CPU T T The CPU
[12]Ppu PPU T T Produces a 256x240 pixel display
[13]PpuRegisters PPU F F Hides some tricky internal PPU state
[14]PaletteControl PPU F T Stores which 13 colors have been chosen of 64 possible
[15]CpuPpuInterconnect PPU F T Maps certain PPU registers to CPU address space
[16]Sprite PPU F F Represents a 4-byte entry in the OAM table
[17]Apu APU T T Generates audio samples
[18]Frame Counter APU T F Generates a clock signal every quarter or half frame that other audio components react to.
[19]Length Counter APU T F Silences an audio channel after a certain number of clock cycles.
[20]Linear Counter APU T F Silences an audio channel according to a timer. Has slightly different timing than the length counter
[21]Triangle APU T F Audio channel for a triangle wave
[22]Sweep APU T F Dynamically changes the pitch of an audio channel
[23]Envelope APU T F Dynamically changes the volume of an audio channel
[24]Pulse APU T F Audio channel for a pulse/square wave
[25]Noise APU T F Audio channel for pseudo-random noise
[26]Dmc APU T F Audio channel for premade audio samples
[27]Ram F T A fixed-size block of readable and writable memory
[28]Rom F T A fixed-size block of read-only memory
[29]MirroredAddressSpace F T Makes another AddressSpace appear in multiple regions. See [30]Memory Mirroring
[31]NullAddressSpace F T Represents unmapped address space. Returns 0 when read, and does nothing on write.
[32]Mapper F T Splits the address space into regions, which are assigned to other AddressSpace types
[33]Joystick Input F T Communicates controller inputs to the CPU
[34]Ines Cartridge F F A game cartridge, represented in [35]iNES format

In this section, I’ll talk about the most important of these, and each of the traits that link them together.


pub trait Clocked { fn clock(&mut self);

A clock cycle is the smallest discrete step that a component can make: There should be no observable changes outside of a clock cycle.

The CPU is an example of a clocked component. A single CPU instruction might:
\* Request an 8-bit value at an address 
 \* Attempt to store an 8-bit value at an address 
 \* Calculate an addition using the Arithmetic Logic Unit(ALU)

Generally, an emulation that does more work at once is faster than a component that simulates each individual clock cycle.

A clock-accurate emulation of a clocked component is the most accurate^[36]2, but programmers - even rugged ones in the 1980s - generally don’t depend on the clock-by-clock details of the CPU. It’s common to assume that a bitwise AND instruction takes 6 clock cycles, but not that it does a memory read on clock #2 and a memory write on clock #6. So it can be an acceptable loss of accuracy to run the entire CPU instruction in one clock cycle, and then do nothing for the next 5.

The NES has a Master Clock, and all other Clocked components run at an integer fraction of its speed:

Component Clock Speed
Master 1:1 = 21.477272 Mhz
CPU 1:12
PPU 1:4
APU 1:24

The APU has components that act on two separate clock signals: One from the APU clock, and one from an internal FrameCounter that sends a signal every half or quarter frame.

Address Space

When a component wants to read or write a value at an address, it places the address on an Address Bus. There is a separate Data Bus that holds the value being read or written. Other components listen for specific addresses that they have been assigned.

An emulated Address Space is an assignment of addresses to actions that components take.

pub trait AddressSpace { fn peek(&mut self, ptr: u16) -> u8; fn poke(&mut self, ptr: u16, value: u8);

The NES has two different address spaces, primarily used by the CPU and PPU respectively:
\* [37]CPU Memory Map 
 \* [38]PPU Memory Map

The CPU mostly handles game logic, while the PPU’s address space stores sprites, tiles, and colors.

Cartridges listen on both address buses. They are not a simple ROM holding a list of bytes, since they can have arbitrary circuitry^[39]3. Games can choose to include extra RAM, coprocessors for specialized calculations, or control registers for changing the address space.

Super Mario Bros 3 animates its background by telling the cartridge to suddenly switch between two different background tile patterns. When the PPU reads from the relevant addresses, the cartridge suddenly starts returning different tile data. Its otherwise not possible to update the entire background in a single frame.

[40]Bank Switching

(Original image from [41]this article)


The CPU handles the game logic: What happens when Mario jumps, stomps on a Goomba, or falls in a pit?

It repeatedly fetches, decodes, and executes an instruction at the current Program Counter - a pointer in CPU address space - which is then incremented to the next instruction.

The CPU can only perform three primitive actions:
\* Request a read(in CPU address space) 
 \* Request a write 
 \* Change its internal registers

All instructions are combinations of these. The [42]Arithmetic Shift Left, with Absolute,X addressing mode instruction(0x1E) takes 7 clock cycles and causes the CPU to perform the following 7 actions^[43]4:
\* (1) Fetch the opcode byte 0x1E at the current program counter 
 \* (2,3) Fetch a two-byte value V immediately after the opcode byte containing a 16-bit address 
 \* (4) Adds the X register to V 
 \* (5) Fetch a one-byte value a located at address X+V. 
 \* (6) Calculate the value b = a << 1, updating several status flags 
 \* (7) Write the value b to address X+V.

It’s a complex operation, but the individual steps are simple: 4 memory fetches, 2 calculations using the ALU, and 1 memory write.

I implemented each instruction in terms of the three primitive operations, generated an instruction-by-instruction log using the [44]nestest test ROM, and verified that it matched the reference log.


At 60 Frames per Second(FPS) the NES’ CPU can execute 29780 clock cycles per frame^[45]5, but there are 61440 different pixels to display. So the CPU is too slow to draw even a blank screen.

A Picture Processing Unit(PPU) runs at triple the CPU’s clock speed, emitting one pixel per cycle. It has more cycles available than pixels to emit, so the idle time(“Vertical Blank”) is used by the CPU to configure the PPU for the next frame.

The PPU draws two things: Background tiles and sprites. Up to 32 x 30 tiles and 64 sprites can be displayed at once, and these share a pool of 512 different 8x8 4-color patterns.

There are 4 tables in PPU address space that configure these:
\* Nametable: A table of 32x30 bytes that specify which 8x8 pattern to use 
 \* Attribute Table: Specifies which 4-color palette is used for each 16x16 group of tiles 
 \* Object Attribute Memory(OAM): Stores the position, palette, and status of 64 sprites 
 \* Palette: There are 8 different 4-color palettes. The first color is always transparent, and the other 3 choose from 64 different System Colors.

The PPU and CPU have different address spaces, but they are not isolated. There are three communication channels between them:
\* There are 8 PPU registers mapped in the CPU’s address space. 
 \* The PPU triggers two CPU interrupts: End-of-scanline and Vertical Blank. 
 \* The cartridge itself can choose to modify its assigned PPU address space based on reads or writes to the CPU address space.

There are 5 primitive actions that the PPU can take on each clock cycle:
\* Request a read( in PPU address space) 
 \* Request a write 
 \* Change its internal registers 
 \* Emit a single pixel 
 \* Trigger a CPU interrupt

It is more difficult to test the PPU than the CPU: Much of the PPU must already be functioning in order to get any output at all. I recommend printing out the 4 tables separately - and once that data is confirmed correct, then test whether these are combined correctly for any particular pixel.

Save States

A common feature in emulators is to save and load a game at any time. Every component implements the Savable trait:

pub trait Savable { fn save(&self, fh: &mut Write); fn load(&mut self, fh: &mut Read);

There are two properties that make this model useful:
\* Once a ROM has been loaded, my emulator doesn’t allocate memory. So every object can be written in-place. 
 \* Loading a ROM always creates the same types of components, so the values saved from a previous program execution can be assumed compatible.

The first property guarantees that pointers aren’t affected by a savestate restore. If a component is dynamically-allocated and a savestate from before or after its lifetime is loaded, then any pointers to it are no longer valid. This would require keeping track of which objects have pointers to which other objects.

If this second property failed(say, if someone unplugged a standard NES controller and plugged in a new SNES controller), then the savestate files would need to include type information so the correct load method is called.

However, assuming these restrictions makes serialization straightforward. For example:

impl Savable for C6502 { fn save(&self, fh: &mut Write) {;;;;;;;;;;;;;;;; } fn load(&mut self, fh: &mut Read) { self.acc.load(fh); self.x.load(fh); self.y.load(fh); self.pc.load(fh); self.sp.load(fh); self.carry.load(fh);; self.interruptd.load(fh); self.decimal.load(fh); self.overflow.load(fh); self.negative.load(fh); self.mapper.load(fh); self.counter.load(fh); self.clocks.load(fh); self.is_tracing.load(fh); self.clocks_to_pause.load(fh); }

Every component except for the primitives like bool or u32 look like that. The primitives aren’t too difficult either:

impl Savable for u32 { fn save(&self, fh: &mut Write) { let bytes = [ ((self >> 0 ) & 0xff) as u8, ((self >> 8 ) & 0xff) as u8, ((self >> 16) & 0xff) as u8, ((self >> 24) & 0xff) as u8, ]; fh.write_all(&bytes); } fn load(&mut self, fh: &mut Read) { let mut bytes = [0u8; 4]; fh.read_exact(&mut bytes); *self = 0; *self |= (bytes [0]as u32) << 0; *self |= (bytes [1]as u32) << 8; *self |= (bytes [2]as u32) << 16; *self |= (bytes [3]as u32) << 24; }

Rust’s traits were useful for implementing savestates. The compiler infers which save or load methods are needed for each type, so the code is uniform.

The video playback feature builds on top of savestates by storing 8 bits per frame - one for whether each controller button was pressed. When a savestate is restored, it also restores the active input list.

Rust Language

The previous section gave a high-level overview of how I designed my NES emulator. In this section, I’ll talk about the Rust language itself.

Integer overflow

By default, Rust will throw an exception if any arithmetic operation overflows. This caught a fair number of bugs during testing, because the [46]wrapping_* functions require explicit type information and it’s important whether a number wraps as a 16-bit value or an 8-bit value.

I used functions like wrapping_add in my emulator, but there are also [47]Wrapping types that imply wrapping arithmetic operations.


In Rust, mutable values must have a single owning variable. Other variables can borrow it, but only one mutable reference is allowed at a single time.

The CPU address space has several PPU registers mapped. So the CPU maintains a permanent mutable reference to the PPU. But the top-level Nes object also owns the PPU.

I worked around this using [48]Box to assign fixed memory addresses to values, and then “unsafe” pointer dereferences when needed.

pub struct Nes { pub cpu: Box, pub apu: Box, pub ppu: Box,
} pub struct CpuPpuInterconnect { ppu: *mut Ppu, cpu: *mut C6502,
} impl AddressSpace for CpuPpuInterconnect { fn peek(&mut self, ptr:u16) -> u8 { let ppu:&mut Ppu = unsafe { &mut *self.ppu }; // Games arenʼt supposed to read some of these, but // if they do, the "open bus" is whatever value was last written // to any PPU register. match map_ppu_port(ptr) { Some(PPUCTRL) => ppu.open_bus, Some(PPUMASK) => ppu.open_bus, Some(PPUSTATUS) => ppu.read_status(), Some(OAMADDR) => ppu.open_bus, Some(OAMDATA) => ppu.read_oam_data(), Some(PPUSCROLL) => ppu.open_bus, Some(PPUADDR) => ppu.open_bus, Some(PPUDATA) => ppu.read_data(), Some(OAMDMA) => ppu.open_bus, port => panic!("INVALID PPU PORT READ {:?} {:x}", port, ptr), } } fn poke(&mut self, ptr:u16, value:u8) { let ppu:&mut Ppu = unsafe { &mut *self.ppu }; ppu.open_bus = value; match map_ppu_port(ptr) { Some(PPUCTRL) => ppu.write_control(value), Some(PPUMASK) => ppu.write_mask(value), Some(PPUSTATUS) => {}, Some(OAMADDR) => ppu.write_oam_address(value), Some(OAMDATA) => ppu.write_oam_data(value), Some(PPUSCROLL) => ppu.write_scroll(value), Some(PPUADDR) => ppu.write_address(value), Some(PPUDATA) => ppu.write_data(value), Some(OAMDMA) => { let cpu = unsafe { &mut *self.cpu }; let ptr_base = (value as u16) << 8; for i in 0..=255 { let addr = ptr_base + i; let v = cpu.peek(addr); ppu.oam[ppu.oam_ptr as usize] = v; ppu.oam_ptr = ppu.oam_ptr.wrapping_add(1); } } port => panic!("INVALID PPU PORT WRITE {:?} {:x} {:x}", port, ptr, value), } }

This use of unsafe means that my emulator is not thread-safe. If both the PPU and CPU ran in separate threads, they could both issue writes to the same address or read a value while it is being written.

A [49]Mutex might be a solution here. The lock would be held for a small amount of time, so it shouldn’t cause much of a performance issue.


In a later article, I plan to train a Reinforcement Learning(RL) agent to play Mario. A faster emulator allows me to gather more sample data, so I spent a few hours ensuring my emulator was comparable to other NES emulators used for RL research.

My benchmark runs 10,000 frames of Mario in a “headless” mode with no video or sound, and then writes a screenshot.

Before I started optimizing it, my emulator got about 350 frames-per-second(FPS). I saw RL people getting between 200 and 450 FPS on various emulators, so I made a few small changes and ended up at about 430 FPS.

I found three optimizations, which each gave a 10% FPS increase.

Compilers try to avoid emitting division instructions, since they are slow. However, division by a non-constant defeats many of the optimizations it uses.

The code

impl Clocked for FrameCounter { fn clock(&mut self) { self.step += 1; let cap = if self.mode { 18641 } else { 14915 }; self.step %= cap; }

compiles to

add $0x1, %eax ; self.step += 1
cmpb $0x0, 0x33(%rbx) ; self.mode
mov $0x3a43,%ecx ; 14915
mov $0x46d1,%edi ; 18641
cmove %ecx,%edi ; let cap = if self.mode ...
xor %edx,%edx
div %di ; self.step %= cap

. Since the FrameCounter is clocked every other CPU cycle, it executes about 14914 div instructions per NES frame, or about 6.4 million per second at the accelerated rate of 430 FPS.

Given the following two conditions:
\* The dividend never decreases 
 \* The dividend never increases by more than the divisor between divisions

The above code can be replaced with a conditional subtraction:

impl Clocked for FrameCounter { fn clock(&mut self) { self.step += 1; let cap = if self.mode { 18641 } else { 14915 }; if self.step >= cap { self.step -= cap; } }

The branch is only taken once every 15,000 clocks, so this bottleneck is completely removed and the emulator as a whole processes 10% more frames.

Consider the following two implementations of the same function:

#! [no_std]#
[no_mangle]pub fn foo(xs: &[usize]) -> usize { let mut acc:usize = 0; for (x, idx) in xs.iter().zip(0..xs.len()) { acc = acc.wrapping_add(*x); acc = acc.wrapping_mul(idx); } return acc;
} #
[no_mangle]pub fn bar(xs: &[usize]) -> usize { let mut acc:usize = 0; for idx in 0..xs.len() { let x = xs[idx]; acc = acc.wrapping_add(x); acc = acc.wrapping_mul(idx); } return acc;

The function foo is a fairly direct translation of the original source:

0000000000000000 : 0: 48 85 f6 test rsi,rsi 3: 74 33 je 38 <foo+0x38> 5: 48 8d 0c f5 00 00 00 lea rcx,[rsi8+0x0] c: 00 d: 31 c0 xor eax,eax f: 31 d2 xor edx,edx 11: 66 2e 0f 1f 84 00 00 nop WORD PTR cs:[rax+rax1+0x0] 18: 00 00 00 1b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax1+0x0] 20: 48 39 f2 cmp rdx,rsi 23: 73 12 jae 37 <foo+0x37> 25: 48 03 04 d7 add rax,QWORD PTR [rdi+rdx8] 29: 48 0f af c2 imul rax,rdx 2d: 48 8d 52 01 lea rdx,[rdx+0x1] 31: 48 83 c1 f8 add rcx,0xfffffffffffffff8 35: 75 e9 jne 20 <foo+0x20> 37: c3 ret 38: 31 c0 xor eax,eax 3a: c3 ret

The variables are mapped as follows:
\* rcx is the remaining number of bytes in the slice xs 
 \* rdx is the variable idx 
 \* rax is the variable acc 
 \* rdi points to the beginning of the data in xs

And the loop executes add and imul instructions until rcx is 0.

However, the function bar unrolls its loop 4 times:

0000000000000000 : 0: 48 85 f6 test rsi,rsi 3: 74 1c je 21 <bar+0x21> 5: 48 8d 46 ff lea rax,[rsi-0x1] 9: 41 89 f0 mov r8d,esi c: 41 83 e0 03 and r8d,0x3 10: 48 83 f8 03 cmp rax,0x3 14: 73 0e jae 24 <bar+0x24> 16: 31 c0 xor eax,eax 18: 31 d2 xor edx,edx 1a: 4d 85 c0 test r8,r8 1d: 75 4e jne 6d <bar+0x6d> 1f: eb 60 jmp 81 <bar+0x81> 21: 31 c0 xor eax,eax 23: c3 ret 24: 4c 29 c6 sub rsi,r8 27: 31 c0 xor eax,eax 29: 31 d2 xor edx,edx 2b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax1+0x0] 30: 48 03 04 d7 add rax,QWORD PTR [rdi+rdx8] 34: 48 0f af c2 imul rax,rdx 38: 48 03 44 d7 08 add rax,QWORD PTR [rdi+rdx8+0x8] 3d: 48 8d 4a 01 lea rcx,[rdx+0x1] 41: 48 0f af c1 imul rax,rcx 45: 48 03 44 d7 10 add rax,QWORD PTR [rdi+rdx8+0x10] 4a: 48 8d 4a 02 lea rcx,[rdx+0x2] 4e: 48 0f af c1 imul rax,rcx 52: 48 03 44 d7 18 add rax,QWORD PTR [rdi+rdx8+0x18] 57: 48 8d 4a 03 lea rcx,[rdx+0x3] 5b: 48 8d 52 04 lea rdx,[rdx+0x4] 5f: 48 0f af c1 imul rax,rcx 63: 48 39 d6 cmp rsi,rdx 66: 75 c8 jne 30 <bar+0x30> 68: 4d 85 c0 test r8,r8 6b: 74 14 je 81 <bar+0x81> 6d: 49 f7 d8 neg r8 70: 48 03 04 d7 add rax,QWORD PTR [rdi+rdx8] 74: 48 0f af c2 imul rax,rdx 78: 48 8d 52 01 lea rdx,[rdx+0x1] 7c: 49 ff c0 inc r8 7f: 75 ef jne 70 <bar+0x70> 81: c3 ret

There are two loops:
\* The loop beginning at 30: processes 4 elements from xs 
 \* The loop beginning at 70: processes one element from xs

The variables are assigned as follows:
\* r8d contains the remaining number of elements to process after the 30: loop has finished. 
 \* rax is the acc variable 
 \* rdi points to the beginning of the data for xs 
 \* rdx is the idx variable, updated once per loop iteration 
 \* rcx is also the idx variable, but updated more frequently in the unrolled 30: loop. 
 \* rsi points to the first data that the 30: loop is unable to process.

The main difference seems to be that foo counts bytes while bar counts elements. Perhaps the tuple (x, idx) has size 16 bytes which is larger than the maximum scale of 8 in addressing modes like QWORD PTR [rdi+rdx*8+C]. rustc might choose its loop strategy before later optimizations make them equivalent. The difference goes away if foo and bar are changed to take u8s rather than usizes.

In any case, changing one of my loops to use bar’s style improved the emulator’s FPS by 10%.
Search order

The CPU Address Space consists of many components which each listen for reads and writes to specific addresses.

The Mapper takes an address and does a linear search through all of them looking for the responsible component.

When I first wrote the code, I mapped the address space from first-to-last:

fn map_nes_cpu(&mut self, joystick1: Box, joystick2: Box, cartridge: Box) { let mut mapper:Mapper = Mapper::new(); let cpu_ram:Ram = Ram::new(0x800); let cpu_ppu:CpuPpuInterconnect = CpuPpuInterconnect::new(self.ppu.deref_mut(), self.cpu.deref_mut()); let apu = self.apu.deref_mut() as *mut Apu; // mapper.map_mirrored(0x0000, 0x07ff, 0x0000, 0x1fff, Box::new(cpu_ram), false); mapper.map_mirrored(0x2000, 0x2007, 0x2000, 0x3fff, Box::new(cpu_ppu), true); mapper.map_address_space(0x4000, 0x4013, Box::new(apu), true);; mapper.map_address_space(0x4014, 0x4014, Box::new(cpu_ppu), true); mapper.map_address_space(0x4015, 0x4015, Box::new(apu), true); mapper.map_address_space(0x4016, 0x4016, joystick1, false); mapper.map_address_space(0x4017, 0x4017, Box::new(apu), true); // TODO - 0x4017 is also mapped to joystick2 mapper.map_address_space(0x4017, 0x4017, joystick2, false); // TODO - joystick2 isnʼt used, but this transfers ownership so it isnʼt deallocated(since it is updated through pointers) mapper.map_null(0x4018, 0x401F); // APU test mode mapper.map_address_space(0x4020, 0xFFFF, cartridge, true); self.cpu.mapper = Box::new(mapper); self.cpu.initialize();

However, this puts the cartridge at the very end. The cartridge contains all of the game’s code, so every time the CPU fetched a new instruction it got a worst-case linear-search.

I experimented with an LRU cache, but found the best improvement by simply reordering those statements to put the most frequently-accessed components first.

This also gave a 10% improvement in FPS.


This article discussed an NES emulator I developed using the Rust programming language.

If you’re interested in learning more about NES development, I recommend the [50]Nesdev Wiki for technical details needed to make emulators or games.

Future articles on this subject may cover:
\* Training an agent to play Mario 
 \* Improving the performance of the emulator 
 \* Developing emulators for other systems 
 \* Compiling the emulator to Javascript to run in-browser 
 \* Compatibility issues with less-common games 
 \* How specific NES games worked internally 
 \* Porting the emulator to run on a GPU^[51]6 
 \* Writing an NES game in Rust 
 \* Making new optimization passes for rustc or LLVM 
 \* Accelerating an emulator with a JIT compiler


Visible links

HackerNewsBot debug: Calculated post rank: 116 - Loop: 219 - Rank min: 100 - Author rank: 17


Rust Cookbook

This Rust Cookbook is a collection of simple examples that demonstrate good practices to accomplish common programming tasks, using the crates of the Rust ecosystem. Read more about Rust Cookbook,…
Article word count: 114

HN Discussion:
Posted by ingve (karma: 100042)
Post stats: Points: 148 - Comments: 16 - 2019-03-17T17:18:02Z

\#HackerNews #cookbook #rust
Article content:


This Rust Cookbook is a collection of simple examples that demonstrate good practices to accomplish common programming tasks, using the crates of the Rust ecosystem.

[1]Read more about Rust Cookbook, including tips for how to read the book, how to use the examples, and notes on conventions.

This project is intended to be easy for new Rust programmers to contribute to, and an easy way to get involved with the Rust community. It needs and welcomes help. For details see [2]
Recipe                    Crates              Categories

[3]Parse command line arguments [4]clap-badge [5]cat-command-line-badge
[6]ANSI Terminal [7]ansi_term-badge [8]cat-command-line-badge
Recipe                      Crates                 Categories

[9]Check number of logical cpu cores [10]num_cpus-badge [11]cat-hardware-support-badge
Recipe                        Crates                            Categories

[12]Declare lazily evaluated constant [13]lazy_static-badge [14]cat-caching-badge [15]cat-rust-patterns-badge
Recipe                 Crates        Categories

[16]Listen on unused port TCP/IP [17]std-badge [18]cat-net-badge


Visible links

HackerNewsBot debug: Calculated post rank: 104 - Loop: 98 - Rank min: 100 - Author rank: 128

Up and Running with React + Rust + Wasm by Preston Richey: #Rust #crates

Oxide: The Essence of Rust by Aaron Weiss, Daniel Patterson, Nicholas D. Matsakis, Amal Ahmed: #Rust #compsci

Pulldown_cmark 0.3 release announcement by Marcus Klaas de Vries: #Rust #crates

Refactoring Varisat: 2. Clause Storage and Unit Propagation by Jannis Harder: #Rust #compsci