L2 Naming

Modularity limit complexity. 很好的模式化的做法是 client/server design.
DNS maps hostnames to IP addresses. It is also a good example of hierarchy.

Example: DNS serve

DNS hierarchy
root-com/edu/net/org/gov
com-apple/google
google-mail/www
DNS look-up algorithms

L3 memory virtualization

virtualization: modularity on a single machine

programs shouldn’t be able to refer to (and corrupt) each others’ memory
virtualize memory
programs should be able to communicate
virtualize communication links
programs should be able to share a CPU w/o one program halting the progress of the others
virtualize processors

Memory virtualization
每个进程有4GB的虚拟地址

How to store the mapping: MMU
如果每个虚拟地址(2^32 entries), 都用32bits的地址来存储mapping的话，那么用16GB to store the table.解决方案是 MMU

Page Tables: address translate, virtual add to real add.

大大缩短表长度的原因是，用offset来直接mapping.

那么现在table有2^(32-12) = 2^20 entries, 每个entry用32bits地址来表示=4MB to store the table.

于是每个entry用31-13(20位)来存储physical page number, 剩下12位, 1位是present (P) bit: is the page currently in DRAM? 1位是read/write (R/W) bit: is the program allowed to write to this address? 1位是user/supervisor (U/S) bit: does the program have access to this address?

接下来思考，如何进一步缩短page table呢: page the page table.

Kernel manages page faults and other interrupts. 用来load page或置换.

abstraction of devices
via System call, which are implemented with interrupts. Using interrupts means the kernel directly accesses the devices, not the user.

L4 Bounded Buffers and locks

programs communicate的抽象: bounded buffers.

Bounded buffer: a buffer that stores (up to) N messages

实现原理是循环队列, 这也是go channel的底层实现原理。
bounded buffer API:

1 2	send(m) m <- receive(

lock API:

1 2	acquire(l) release(l)

send(bb, message): while True:
  if bb.in – bb.out < N:
    bb.buf[bb.in mod N] <- message
    bb.in <- bb.in + 1
  return

receive(bb): while True:
  if bb.out < bb.in:
    message <- bb.buf[bb.out mod N]
    bb.out <- bb.out + 1
    return message

locks: allow only one CPU to be inside a piece of code at a time

Avoid race in bounded buffer

send(bb, message):
  acquire(bb.lock)
  while bb.in - bb.out == N:
    release(bb.lock)
    acquire(bb.lock)
  bb.buf[bb.in mod N] <- message
  bb.in <- bb.in + 1
  release(bb.lock)
  return

Implementing locks

acquire(lock):
  while lock != 0:
    do nothing //spin lock
  lock = 1

1 2	release(lock): lock = 0

problem: race condition (need locks to implement locks!)

spin lock的真正的原理：

acquire(lock):
  do:
    r = 1
    XCHG r, lock //atom
  while r == 1

交换指令XCHG是两个寄存器，寄存器和内存变量之间内容的交换指令

1 2	release(lock): lock = 0

L5 Threads

Thread
thread: a virtual processor
API:

suspend(): save state of current thread to memory
resume(): restore state from memory

Condition Variables:类似与信号量
let threads wait for events, and get notified when they occur.
API:

wait(cv): yield processor and wait to be notified of cv

notify(cv): notify waiting threads of cv
eg: applied in bounded buffers

send(bb, message):
  acquire(bb.lock)
  while bb.in - bb.out == N:
    release(bb.lock)
    wait(bb.not_full)//hang
    acquire(bb.lock)
  bb.buf[bb.in mod N] <- message
  bb.in <- bb.in + 1
  release(bb.lock)
  notify(bb.not_empty)//notify
  return

(threads in receive() will wait on bb.not_empty and notify of bb.not_full)

问题: lost notify

改进API:
wait(cv,lock): yield processor, release lock, wait to be notified of cv
notify(cv): notify waiting threads of cv

send(bb, message):
  acquire(bb.lock)
  while bb.in - bb.out == N:
    wait(bb.not_full, bb.lock)//release lock, wait to be notified
  bb.buf[bb.in mod N] <- message
  bb.in <- bb.in + 1
  release(bb.lock)
  notify(bb.not_empty)
  return

wait(cv, lock):
  acquire(t_lock)
  release(lock)
  id = cpus[CPU].thread
  threads[id].cv = cv
  threads[id].state = WAITING
  yield_wait()
  release(t_lock)
  acquire(lock)

notify(cv):
  acquire(t_lock)
  for id = 0 to N-1:
    if threads[id].cv == cv &&
      threads[id].state == WAITING:
      threads[id].state = RUNNABLE
  release(t_lock)

L6 OS and virtual machines

vm结构
VMM runs in kernel-mode on hardware, to let VMs safely share access to physical hardware.
VMM’s goal: virtualize hardware
run VMM on hardware in kernel mode with guest OSes in user mode.

Role for VMM

Allocate resources
- Dispatch events
- Deal with instructions from guest OS that require interaction with the physical hardware
  - Attempt 1: emulate every single instruction
- Problen: Slow
  - Attempt 2: guest OSes run instructions directly on CPU
- Problem: dealing with privileged instructions (can’t run in kernel mode; then we’d be back to our original problem)
  - VMM will deal with handling privileged instructions

VMM Implementation

Trap and emulate
- Guest OS in user mode
  - Privileged instructions cause an exception; VMM intercepts these and emulates
  - If VMM can’t emulate, send exception back up to guest OS
- Problems:
  - How to do the emulate
    - How to deal with instructions that don’t trigger an interrupt but that the VMM still needs to intercept

Virtualizing memory

VMM needs to translate guest OS addresses into physical memory addresses. Three layers: guest virtual, guest physical, host physical
- Approach 1: Shadow pages
  - Guest OS loads PTR; causes interrupt. VMM intercepts
    - VMM locates guest OS’s page table. Combines guest OS’s table with its own table, constructing a third table mapping guest virtual to host physical
    - VMM loads host physical addr of this new page table into the hardware PTR(pointer)
    - If guest OS modifies its page table, no interrupt thrown. To force an interrupt, VMM marks guest OS’s page table as read-only memory
- Approach 2
  - Modern hardware has support for virtualization
  - Physical hardware (effectively) knows about both levels of tables: will do lookup in the guest OS’s page table and then the VMM’s page table

Virtualizing U/K bit(page table上的有效位)

Problem with basic trap-and-emulate: U/K bit involved in some instructions that don’t cause exception (e.g., reading U/K bit, writing it to U)
- Few solutions:
  - Para-virtualization: modify guest OS. Hard to do, and goes against our compatibility goal
  - Binary translation: VMM analyzes code from guest OS and replaces problematic instructions
  - Hardware support: some architectures have virtualization support built in. Have special VMM operating mode in addition to the U/K bit
- Hardware support is arguably the best. Makes VMM’s job easier.

Monolithic kernels

VMs protect OSes from each other’s faults, protect physical machine from OS faults. Why so many bugs, though?
The Linux kernel is, effectively, one large C program. Careful software engineering, but very little modularity within the kernel itself.
Bugs come about because of its complexity
Kernel bugs = entire system failure (recall the in-class demo)
Even worse: adversary can exploit these bugs

Microkernels: alternative to monolithic kernels

Put subsystems – file servers, device drivers, etc. – in user programs. More modular.
There will still be bugs but:
- Fewer, because of decreased complexity
- A single bug is less likely to crash the entire system
Why isn’t Linux a microkernel, then?
- High communication cost between modules
- Not clear that moving programs to user space is worth it
- Hard to balance dependencies (e.g., sharing memory across
  modules)
- Redesign is tough!
- Spend a year of developer time rewriting the kernel or adding
  new features?
- Microkernels can make it more difficult to change interfaces
Some parts of Linux do have microkernel design aspects

L7 Performance

两个衡量指标：latency, throughput

Throughput: number of requests over a unit of time
Latency: amount of time for a single request

L8 Networking

Network layer

7   [      Application     ] - super high level
5/6 [ Session/Presentation ]
4   [      Transport       ] - reliable (maybe) delivery(TCP/UDP)
3   [      Network - IP    ] - addressing/routing
2   [         Link         ] - point-to-point links
1   [         Physical     ] - physical medium

Problems we deal with today

DDoS: send a lot of traffic to one machine to consume its
resources. Hard to prevent. Internet wasn’t designed with
accountability in mind.
Security: Internet was not designed for security. DNS, BGP, etc.,
have no secure infrastructure
Mobility: No one every imagined this
Address Space Depletion: IPv4 -> IPv6
Congestion control: should probably change given the changes we’ve
seen in the Internet (more about this later)

L11 Network: Reliable Transport and Congestion Control

TCP
Reliable Transport via sliding-window protocol

Goal: receiving application gets a complete, in-order bytestream from the sender. One copy of every packet, in order.
- Why do we need it? Network is unreliable. Packets get dropped, can arrive out-of-order.
- Basics:
  - Every data packet gets a sequence number (1, 2, 3, …)
  - Sender has W outstanding packets at any given time. W = window size
  - When receive gets a packet, it sends an ACK back. ACKs are cumulative: An ACK for X indicates “I have received all packets up to and including X.”
  - If sender doesn’t receive an ACK indicating that packet X has been received, after some amount of time it will “timeout” and retransmit X.
    - Maybe X was lost, its ACK was lost, or its ACK is delayed
    - The timeout = proportional to (but a bit larger than) the RTT of the path between sender and receiver
  - At receiver: keep buffer to avoid delivering out-of-order packets, keep track of last-packet-delivered to avoid delivering duplicates.

Congestion Control

Goals: Efficiency and fairness
- Minimize drops, minimize delay, maximize utilization
- Share bandwidth fairly among all connections that are using it
  - FOR NOW: assume all senders have infinite offered load. Fairness = splitting bandwidth equally amongst them.
  - But no senders knows how many other senders there are, and that number can change over time.
  - We’ll use window-based congestion control. Switches are dumb (can only drop packets); senders are smart
Mechanisms
- Slow Start
  - At the beginning of the connection, exponential increase the window (double it every RTT until you see loss)
  - Decreases the time it takes for the initial window to “ramp up”
- Fast Retransmit/Fast Recovery
  - When a sender receives an ACK with sequence number X, and then three duplicates of that packet, it immediately retransmits packet X+1 (remember: ACKs are cumulative)
    Ex:Send 123456
    Receive12 222
    Sender receives 4 ACKs total with sequence number “2”; infers that packet 3 is lost, immediately retransmits
    - On fast-retransmit, window decrease is as before: W = W/2
    - In fact, when a packet is lost due to timeout, TCP behaves differently: W = 1, then do slow-start until the last good window, and then start additive increase.
    - (See slide for diagram)
    - Reasoning: if there is a retransmission due to timeout, then there is significant loss in the network, and senders should back way off.

L13 Networking: P2P Networks + CDNs

file-sharing, VoIP, and video-streaming的解决方案:
P2P networks, or related constructs (CDNs).

File-sharing: getting a file from one person (machine) to another
- Can use client/server
  - client requests file, server responds with the data
  - HTTP, FTP work this way
    - Downsides: single point of failure, expensive, doesn’t scale
    - Could use CDNs
  - Buy multiple servers, put them near clients to decrease latency
  - No single point of failure, scales better
  - See the next recitation for more discussion
Peer-to-peer (P2P) networks for file-sharing
- Distribute the architecture to the extreme
- Once a client downloads (part of) the file from the server, that
  client can upload (part of) the file to others. Put clients to
  work!
- In theory: infinitely scalable
- P2P networks create overlays on top of the underlying Internet (so do CDNs)
- Problem: what if users aren’t willing to upload?
BitTorrent: how to incentivize peers to upload
- Basics of original BitTorrent (BT) protocol:
  - Create a .torrent file, which contains meta-information about
    the file (file name, length, info about pieces that comprise the
    file, URL of tracker)
  - Have a tracker. A server that knows the identity of all the
    peers involved in your file transfer.
  - To download:
    - Peer contacts tracker
    - Tracker responds with list of other peers involved in transfer
    - Peer connects to these other peers, begins to transfer blocks
      (see below)
    - Some peers are seeders: already have the entire file (maybe
      servers that host the file, or just nice peers who are
      sticking around)
- In the actual download, peers request blocks: pieces of pieces
  - Details/terminology doesn’t matter. Just know that blocks are small (~16KB) chunks of the file.
  - Request blocks in a random order (more or less)
  - What incentivizes users to upload (UL) rather than just
    download(DL)ing?
    - High-level: users aren’t allowed to DL from a user unless
      they’re also ULing to that user
      - So peers want mutual interest: A has to have blocks that B
        needs, and vice versa.
    - Protocol is divided into rounds. In round n, some number of
      peers upload blocks to Peer X. In round n+1, Peer X will send
      blocks to the peers that uploaded the most in round n.
      (Typically, to the top four peers.)
    - How do peers get started? Each peer reserves some (small)
      amount of bandwidth to give away freely
  - This method of incentivizing peers is part of what allowed P2P
    file-sharing to take off.
  - Lingering problem: tracker is central point of failure
  - Most BT clients today are “trackerless”, and use Distributed Hash
    Tables (DHTs) instead.
VoIP: Voice over IP

Talking specifically about Skype, a proprietary system
Skype used to use a P2P network for two things: to improve
performance, allow certain connections to work at all.
Recall the first networking lecture. Internet bred NATs: Network Address Translators.
- Consider client A behind a NAT, who wants to initiate a connection to server S. A’s IP is private (can’t route to it);
  S’s and N’s are public. A — N —- S
  - A sends a packet: [to:S from:A]
  - N rewrites the header: [to:S from:N]
    - and stores some state
  - S receives it, sends response back to N: [to:N from:S]
  - N uses stored state to figure out that this packet is really
    meant for A
    - N will keep track of the port(s) that A is communicating on.
      Communication via those ports is then meant for A.
      - N will keep track of the port(s) that A is communicating on. Communication via those ports is then meant for A.
Now imagine two clients, both behind NATs
A — N1 —- N2 — S
- Now A doesn’t even know S’s IP (private IPs aren’t routable). It also doesn’t know N2’s IP; it has no way to get that.
- For Skype: means that A and S can’t call each other
- Skype provides a directory service, so assume we can get N2’s public IP. When N2 gets packet destined for S, it has no idea what to do with it.
- Skype will employ an additional node – a “supernode” – P, with a
  public IP, and route A and S’s calls through P:
```
P
```
  / \
  A–N1 N2–S
  - P keeps a bunch of state to get this to work, and A and S must
    both be registered users of Skype. A and S will connect to P
    as part of starting up their Skype client (so private IP users
    initiate connections to public IPs)
    - In reality, there is not one supernode, but a network of supernodes. A, S are both connected to nodes in that network, and the overlay network routes data between them.
      - Seems like this will affect performance, so Skype only let you be a supernode if your memory/CPU is sufficient (and you have a public IP)
    - Good idea?
      - A/S might not want their (encrypted) call routed through someone else
      - P might not want to pay to transit traffic for A and S
    - Today: Microsoft owns all of the supernodes, making this less of a P2P network and more of a hierarchy
      - Skype also claims that its P2P system improves quality by allowing for more optimal routing

Video-streaming (briefly)

Can we just use BitTorrent to stream (live) video?
- streaming requires getting blocks (roughly) in order
- also requires certain amount of bandwidth at all times
  - Probably not
- BT works because peers can acquire blocks in any order
- Moreover, most BT peers are on residential links, which have underwhelming upload bandwidth.
  - What’s good for streaming? CDNs!
- Thursday’s recitation: what CDNs bring to the table that P2P networks don’t
- Also think about whether you want to reconsider CDNs for file-sharing

R13 Akamai CDNs

现有网络的弱点

Peering point congestion. 网络中间设备投入较少，比起第一英里（即网站托管）和最后一英里(即最终用户),因为缺乏动机。
Inefficient routing protocols(BGP, border gateway protocol), BGP的cost是用hop来计算的, knowing nothing about topologies, latencies, or real-time congestion of the underlaying networks.
Unreliable networks, eg: cable cuts, misconfigured routers, DDoS attacks, power outages.
Inefficient communications protocols, TCP开销大。
- 因为数据包丢失会触发TCP重传，从而进一步降低通信速度。
- 交互式应用程序，HTTP请求所需的多次往返可以快速congestion，从而影响应用程序性能
- TCP也成为视频和其他大型文件的严重性能瓶颈。因为它需要对发送的每个数据包窗口进行接收器确认，所以吞吐量（使用标准TCP时）与网络延迟或往返时间（RTT）成反比。因此，服务器和最终用户之间的距离可能成为下载速度和视频观看质量的首要瓶颈。
Scalability, 为了保证偶尔的高峰需求是是昂贵且耗时的。供应不足意味着可能会失去业务，而过度配置则意味着在未使用的基础设施上浪费资金。并且随着视频业务的成熟，问题也越来越严重。

Akamai service
拥有可加速整个Web或基于IP的应用程序的application delivery network.

提供高质量的实时和点播媒体交付的媒体交付网络，以及以分布式方式部署和执行整个Java J2EE应用程序的EdgeComputing网络。
分布式网络中维护可视性和控制内容的能力。这意味着提供强大的安全性，日志记录，SLA，诊断，报告和分析以及管理工具。这里，与内容传递本身一样，存在要克服的规模，可靠性和性能的挑战。

Delivery Networks as Virtual Networks
- Delivery Networks是一种虚拟网络，虚拟网络方法的优点在于它可以在现有的Internet上工作，不需要客户端软件，也不需要更改底层网络。而且，由于它几乎完全由软件构建，因此随着互联网的发展，它可以很容易地适应未来的需求。
- Akamai网络是一个非常大的分布式系统，由数以万计的全球部署服务器组成，这些服务器运行复杂的算法，以实现高度可扩展的分布式应用程序的交付。我们可以将其视为由多个传送网络组成，每个传送网络都针对不同类型的内容（例如，静态Web内容，流媒体或动态应用程序）进行定制。从较高层面来看，这些交付网络共享一个类似的架构，如图3所示，但每个系统组件的基础技术和实现可能不同，以便最好地适应特定类型的内容，流媒体或正在交付的应用程序。
1. 工作流程
- Mapping system(arrow 1): URL -> map to the ip of an Edge Server to server the content.
- Edge Server(arrow 2): part of the edge server platform. Located in thousands of sites around the world. Processing requests from nearby users and serving the requested content

Transport System: Edge server need to request content from an origin server. For instance, dynamic content on a web page that is customized for each user cannot be entirely cached by the edge platform.
The communications and control system: disseminating status information, control messages, and configuration updates in a fault-tolerant and timely fashion.
The data collection and analysis system: collecting and processing data from various sources such as server logs, client logs, and network and server information. The collected data can be used for monitoring, alerting, analytics, reporting, and billing.
The management portal serves:
- provides a configuration management platform that allows an enterprise customer to retain fine-grained control. These configurations are updated across the edge platform from the management portal via the communications and control system
- provides the enterprise with visibility on how their users are interacting with their applications and content, including reports on audience demographics and traffic metrics.

Design Principles
假设故障在CDN中是频繁发生的。这个理念决定每个级别的设计决策。比如用强大的商用服务器比具有大量硬件冗余的更昂贵的服务器更有意义。

Reliability: 确保组件的完全冗余（没有单点故障），构建多级容错，并使用PAXOS [26]和分散式领导者选举等协议来适应系统组件故障的可能性。
Scalability: 全球有超过60,000台机器（并且不断增长），所有平台组件必须具有高度可扩展性。在基本级别，扩展意味着处理更多流量，内容和客户。这也转化为处理越来越多的必须收集和分析的结果数据，以及构建必须支持越来越多的分布式机器的通信，控制和映射系统。
Limit the necessity for human management: 因此，系统必须能够响应故障，处理负载和容量的变化，自我调整性能，并以最少的人为干预安全地部署软件和配置更新。
Performance: 不仅从改善最终用户响应时间的角度，而且从整个平台的许多不同指标，如缓存命中率和网络资源利用率。这项工作的一个额外好处是能源效率; 例如，内核和其他软件优化可以使用更少的计算机实现更大的容量和更多的流量。

1. L2 Naming
2. L3 memory virtualization
3. L4 Bounded Buffers and locks
4. L5 Threads
5. L6 OS and virtual machines
6. L7 Performance
7. L8 Networking
8. L11 Network: Reliable Transport and Congestion Control
9. L13 Networking: P2P Networks + CDNs
10. R13 Akamai CDNs