L2 Naming
- Modularity limit complexity. 很好的模式化的做法是 client/server design.
- DNS maps hostnames to IP addresses. It is also a good example of hierarchy.
Example: DNS serve
DNS hierarchy
root-com/edu/net/org/gov
com-apple/google
google-mail/wwwDNS look-up algorithms
L3 memory virtualization
- virtualization: modularity on a single machine
- programs shouldn’t be able to refer to (and corrupt) each others’ memory
virtualize memory - programs should be able to communicate
virtualize communication links - programs should be able to share a CPU w/o one program halting the progress of the others
virtualize processors
- Memory virtualization
每个进程有4GB的虚拟地址
- How to store the mapping: MMU
如果每个虚拟地址(2^32 entries), 都用32bits的地址来存储mapping的话,那么用16GB to store the table.解决方案是 MMU
Page Tables: address translate, virtual add to real add.
大大缩短表长度的原因是,用offset来直接mapping.
那么现在table有2^(32-12) = 2^20 entries, 每个entry用32bits地址来表示=4MB to store the table.
于是每个entry用31-13(20位)来存储physical page number, 剩下12位, 1位是present (P) bit: is the page currently in DRAM? 1位是read/write (R/W) bit: is the program allowed to write to this address? 1位是user/supervisor (U/S) bit: does the program have access to this address?
接下来思考,如何进一步缩短page table呢: page the page table.
Kernel manages page faults and other interrupts. 用来load page或置换.
- abstraction of devices
via System call, which are implemented with interrupts. Using interrupts means the kernel directly accesses the devices, not the user.
L4 Bounded Buffers and locks
programs communicate的抽象: bounded buffers.
- Bounded buffer: a buffer that stores (up to) N messages
实现原理是循环队列, 这也是go channel的底层实现原理。
bounded buffer API:1
2send(m)
m <- receive(
lock API:1
2acquire(l)
release(l)
1 | send(bb, message): while True: |
1 | receive(bb): while True: |
locks: allow only one CPU to be inside a piece of code at a time
Avoid race in bounded buffer
1
2
3
4
5
6
7
8
9send(bb, message):
acquire(bb.lock)
while bb.in - bb.out == N:
release(bb.lock)
acquire(bb.lock)
bb.buf[bb.in mod N] <- message
bb.in <- bb.in + 1
release(bb.lock)
returnImplementing locks
1 | acquire(lock): |
1 | release(lock): |
problem: race condition (need locks to implement locks!)
spin lock的真正的原理:1
2
3
4
5acquire(lock):
do:
r = 1
XCHG r, lock //atom
while r == 1
交换指令XCHG是两个寄存器,寄存器和内存变量之间内容的交换指令
1 | release(lock): |
L5 Threads
- Thread
thread: a virtual processor
API:
- suspend(): save state of current thread to memory
- resume(): restore state from memory
- Condition Variables:类似与信号量
let threads wait for events, and get notified when they occur.
API:
- wait(cv): yield processor and wait to be notified of cv
- notify(cv): notify waiting threads of cv
eg: applied in bounded buffers1
2
3
4
5
6
7
8
9
10
11send(bb, message):
acquire(bb.lock)
while bb.in - bb.out == N:
release(bb.lock)
wait(bb.not_full)//hang
acquire(bb.lock)
bb.buf[bb.in mod N] <- message
bb.in <- bb.in + 1
release(bb.lock)
notify(bb.not_empty)//notify
return
(threads in receive() will wait on bb.not_empty and notify of bb.not_full)
问题: lost notify
改进API:
wait(cv,lock): yield processor, release lock, wait to be notified of cv
notify(cv): notify waiting threads of cv1
2
3
4
5
6
7
8
9send(bb, message):
acquire(bb.lock)
while bb.in - bb.out == N:
wait(bb.not_full, bb.lock)//release lock, wait to be notified
bb.buf[bb.in mod N] <- message
bb.in <- bb.in + 1
release(bb.lock)
notify(bb.not_empty)
return
1 | wait(cv, lock): |
1 | notify(cv): |
L6 OS and virtual machines
vm结构
VMM runs in kernel-mode on hardware, to let VMs safely share access to physical hardware.
VMM’s goal: virtualize hardware
run VMM on hardware in kernel mode with guest OSes in user mode.
- Role for VMM
- Allocate resources
- Dispatch events
- Deal with instructions from guest OS that require interaction with the physical hardware
- Attempt 1: emulate every single instruction
- Problen: Slow
- Attempt 2: guest OSes run instructions directly on CPU
- Problem: dealing with privileged instructions (can’t run in kernel mode; then we’d be back to our original problem)
- VMM will deal with handling privileged instructions
- VMM Implementation
- Trap and emulate
- Guest OS in user mode
- Privileged instructions cause an exception; VMM intercepts these and emulates
- If VMM can’t emulate, send exception back up to guest OS
- Problems:
- How to do the emulate
- How to deal with instructions that don’t trigger an interrupt but that the VMM still needs to intercept
- How to do the emulate
- Guest OS in user mode
- Virtualizing memory
- VMM needs to translate guest OS addresses into physical memory addresses. Three layers: guest virtual, guest physical, host physical
- Approach 1: Shadow pages
- Guest OS loads PTR; causes interrupt. VMM intercepts
- VMM locates guest OS’s page table. Combines guest OS’s table with its own table, constructing a third table mapping guest virtual to host physical
- VMM loads host physical addr of this new page table into the hardware PTR(pointer)
- If guest OS modifies its page table, no interrupt thrown. To force an interrupt, VMM marks guest OS’s page table as read-only memory
- Guest OS loads PTR; causes interrupt. VMM intercepts
- Approach 2
- Modern hardware has support for virtualization
- Physical hardware (effectively) knows about both levels of tables: will do lookup in the guest OS’s page table and then the VMM’s page table
- Approach 1: Shadow pages
- Virtualizing U/K bit(page table上的有效位)
- Problem with basic trap-and-emulate: U/K bit involved in some instructions that don’t cause exception (e.g., reading U/K bit, writing it to U)
- Few solutions:
- Para-virtualization: modify guest OS. Hard to do, and goes against our compatibility goal
- Binary translation: VMM analyzes code from guest OS and replaces problematic instructions
- Hardware support: some architectures have virtualization support built in. Have special VMM operating mode in addition to the U/K bit
- Hardware support is arguably the best. Makes VMM’s job easier.
- Few solutions:
- Monolithic kernels
- VMs protect OSes from each other’s faults, protect physical machine from OS faults. Why so many bugs, though?
- The Linux kernel is, effectively, one large C program. Careful software engineering, but very little modularity within the kernel itself.
- Bugs come about because of its complexity
- Kernel bugs = entire system failure (recall the in-class demo)
- Even worse: adversary can exploit these bugs
- Microkernels: alternative to monolithic kernels
- Put subsystems – file servers, device drivers, etc. – in user programs. More modular.
- There will still be bugs but:
- Fewer, because of decreased complexity
- A single bug is less likely to crash the entire system
- Why isn’t Linux a microkernel, then?
- High communication cost between modules
- Not clear that moving programs to user space is worth it
- Hard to balance dependencies (e.g., sharing memory across
modules) - Redesign is tough!
- Spend a year of developer time rewriting the kernel or adding
new features? - Microkernels can make it more difficult to change interfaces
- Some parts of Linux do have microkernel design aspects
L7 Performance
两个衡量指标:latency, throughput
- Throughput: number of requests over a unit of time
- Latency: amount of time for a single request
L8 Networking
Network layer
1
2
3
4
5
67 [ Application ] - super high level
5/6 [ Session/Presentation ]
4 [ Transport ] - reliable (maybe) delivery(TCP/UDP)
3 [ Network - IP ] - addressing/routing
2 [ Link ] - point-to-point links
1 [ Physical ] - physical mediumProblems we deal with today
- DDoS: send a lot of traffic to one machine to consume its
resources. Hard to prevent. Internet wasn’t designed with
accountability in mind. - Security: Internet was not designed for security. DNS, BGP, etc.,
have no secure infrastructure - Mobility: No one every imagined this
- Address Space Depletion: IPv4 -> IPv6
- Congestion control: should probably change given the changes we’ve
seen in the Internet (more about this later)
L11 Network: Reliable Transport and Congestion Control
- TCP
Reliable Transport via sliding-window protocol
- Goal: receiving application gets a complete, in-order bytestream from the sender. One copy of every packet, in order.
- Why do we need it? Network is unreliable. Packets get dropped, can arrive out-of-order.
- Basics:
- Every data packet gets a sequence number (1, 2, 3, …)
- Sender has W outstanding packets at any given time. W = window size
- When receive gets a packet, it sends an ACK back. ACKs are cumulative: An ACK for X indicates “I have received all packets up to and including X.”
- If sender doesn’t receive an ACK indicating that packet X has been received, after some amount of time it will “timeout” and retransmit X.
- Maybe X was lost, its ACK was lost, or its ACK is delayed
- The timeout = proportional to (but a bit larger than) the RTT of the path between sender and receiver
- At receiver: keep buffer to avoid delivering out-of-order packets, keep track of last-packet-delivered to avoid delivering duplicates.
- Congestion Control
Goals: Efficiency and fairness
- Minimize drops, minimize delay, maximize utilization
- Share bandwidth fairly among all connections that are using it
- FOR NOW: assume all senders have infinite offered load. Fairness = splitting bandwidth equally amongst them.
- But no senders knows how many other senders there are, and that number can change over time.
- We’ll use window-based congestion control. Switches are dumb (can only drop packets); senders are smart
Mechanisms
- Slow Start
- At the beginning of the connection, exponential increase the window (double it every RTT until you see loss)
- Decreases the time it takes for the initial window to “ramp up”
- Fast Retransmit/Fast Recovery
- When a sender receives an ACK with sequence number X, and then three duplicates of that packet, it immediately retransmits packet X+1 (remember: ACKs are cumulative)
Ex:Send 123456
Receive12 222
Sender receives 4 ACKs total with sequence number “2”; infers that packet 3 is lost, immediately retransmits- On fast-retransmit, window decrease is as before: W = W/2
- In fact, when a packet is lost due to timeout, TCP behaves differently: W = 1, then do slow-start until the last good window, and then start additive increase.
- (See slide for diagram)
- Reasoning: if there is a retransmission due to timeout, then there is significant loss in the network, and senders should back way off.
- When a sender receives an ACK with sequence number X, and then three duplicates of that packet, it immediately retransmits packet X+1 (remember: ACKs are cumulative)
- Slow Start
L13 Networking: P2P Networks + CDNs
file-sharing, VoIP, and video-streaming的解决方案:
P2P networks, or related constructs (CDNs).
File-sharing: getting a file from one person (machine) to another
- Can use client/server
- client requests file, server responds with the data
- HTTP, FTP work this way
- Downsides: single point of failure, expensive, doesn’t scale
- Could use CDNs
- Buy multiple servers, put them near clients to decrease latency
- No single point of failure, scales better
- See the next recitation for more discussion
- Can use client/server
Peer-to-peer (P2P) networks for file-sharing
- Distribute the architecture to the extreme
- Once a client downloads (part of) the file from the server, that
client can upload (part of) the file to others. Put clients to
work! - In theory: infinitely scalable
- P2P networks create overlays on top of the underlying Internet (so do CDNs)
- Problem: what if users aren’t willing to upload?
BitTorrent: how to incentivize peers to upload
- Basics of original BitTorrent (BT) protocol:
- Create a .torrent file, which contains meta-information about
the file (file name, length, info about pieces that comprise the
file, URL of tracker) - Have a tracker. A server that knows the identity of all the
peers involved in your file transfer. - To download:
- Peer contacts tracker
- Tracker responds with list of other peers involved in transfer
- Peer connects to these other peers, begins to transfer blocks
(see below) - Some peers are seeders: already have the entire file (maybe
servers that host the file, or just nice peers who are
sticking around)
- Create a .torrent file, which contains meta-information about
- In the actual download, peers request blocks: pieces of pieces
- Details/terminology doesn’t matter. Just know that blocks are small (~16KB) chunks of the file.
- Request blocks in a random order (more or less)
- What incentivizes users to upload (UL) rather than just
download(DL)ing?- High-level: users aren’t allowed to DL from a user unless
they’re also ULing to that user- So peers want mutual interest: A has to have blocks that B
needs, and vice versa.
- So peers want mutual interest: A has to have blocks that B
- Protocol is divided into rounds. In round n, some number of
peers upload blocks to Peer X. In round n+1, Peer X will send
blocks to the peers that uploaded the most in round n.
(Typically, to the top four peers.) - How do peers get started? Each peer reserves some (small)
amount of bandwidth to give away freely
- High-level: users aren’t allowed to DL from a user unless
- This method of incentivizing peers is part of what allowed P2P
file-sharing to take off. - Lingering problem: tracker is central point of failure
- Most BT clients today are “trackerless”, and use Distributed Hash
Tables (DHTs) instead.
- Basics of original BitTorrent (BT) protocol:
VoIP: Voice over IP
- Talking specifically about Skype, a proprietary system
- Skype used to use a P2P network for two things: to improve
performance, allow certain connections to work at all. - Recall the first networking lecture. Internet bred NATs: Network Address Translators.
- Consider client A behind a NAT, who wants to initiate a connection to server S. A’s IP is private (can’t route to it);
S’s and N’s are public. A — N —- S- A sends a packet: [to:S from:A]
- N rewrites the header: [to:S from:N]
- and stores some state
- S receives it, sends response back to N: [to:N from:S]
- N uses stored state to figure out that this packet is really
meant for A- N will keep track of the port(s) that A is communicating on.
Communication via those ports is then meant for A.- N will keep track of the port(s) that A is communicating on. Communication via those ports is then meant for A.
- N will keep track of the port(s) that A is communicating on.
- Consider client A behind a NAT, who wants to initiate a connection to server S. A’s IP is private (can’t route to it);
- Now imagine two clients, both behind NATs
A — N1 —- N2 — S- Now A doesn’t even know S’s IP (private IPs aren’t routable). It also doesn’t know N2’s IP; it has no way to get that.
- For Skype: means that A and S can’t call each other
- Skype provides a directory service, so assume we can get N2’s public IP. When N2 gets packet destined for S, it has no idea what to do with it.
- Skype will employ an additional node – a “supernode” – P, with a
public IP, and route A and S’s calls through P:
/ \P
A–N1 N2–S- P keeps a bunch of state to get this to work, and A and S must
both be registered users of Skype. A and S will connect to P
as part of starting up their Skype client (so private IP users
initiate connections to public IPs)- In reality, there is not one supernode, but a network of supernodes. A, S are both connected to nodes in that network, and the overlay network routes data between them.
- Seems like this will affect performance, so Skype only let you be a supernode if your memory/CPU is sufficient (and you have a public IP)
- Good idea?
- A/S might not want their (encrypted) call routed through someone else
- P might not want to pay to transit traffic for A and S
- Today: Microsoft owns all of the supernodes, making this less of a P2P network and more of a hierarchy
- Skype also claims that its P2P system improves quality by allowing for more optimal routing
- In reality, there is not one supernode, but a network of supernodes. A, S are both connected to nodes in that network, and the overlay network routes data between them.
- P keeps a bunch of state to get this to work, and A and S must
- Video-streaming (briefly)
- Can we just use BitTorrent to stream (live) video?
- streaming requires getting blocks (roughly) in order
- also requires certain amount of bandwidth at all times
- Probably not
- BT works because peers can acquire blocks in any order
- Moreover, most BT peers are on residential links, which have underwhelming upload bandwidth.
- What’s good for streaming? CDNs!
- Thursday’s recitation: what CDNs bring to the table that P2P networks don’t
- Also think about whether you want to reconsider CDNs for file-sharing
R13 Akamai CDNs
- 现有网络的弱点
- Peering point congestion. 网络中间设备投入较少,比起第一英里(即网站托管)和最后一英里(即最终用户),因为缺乏动机。
- Inefficient routing protocols(BGP, border gateway protocol), BGP的cost是用hop来计算的, knowing nothing about topologies, latencies, or real-time congestion of the underlaying networks.
- Unreliable networks, eg: cable cuts, misconfigured routers, DDoS attacks, power outages.
- Inefficient communications protocols, TCP开销大。
- 因为数据包丢失会触发TCP重传,从而进一步降低通信速度。
- 交互式应用程序,HTTP请求所需的多次往返可以快速congestion,从而影响应用程序性能
- TCP也成为视频和其他大型文件的严重性能瓶颈。因为它需要对发送的每个数据包窗口进行接收器确认,所以吞吐量(使用标准TCP时)与网络延迟或往返时间(RTT)成反比。因此,服务器和最终用户之间的距离可能成为下载速度和视频观看质量的首要瓶颈。
- Scalability, 为了保证偶尔的高峰需求是是昂贵且耗时的。供应不足意味着可能会失去业务,而过度配置则意味着在未使用的基础设施上浪费资金。并且随着视频业务的成熟,问题也越来越严重。
- Akamai service
拥有可加速整个Web或基于IP的应用程序的application delivery network.
- 提供高质量的实时和点播媒体交付的媒体交付网络,以及以分布式方式部署和执行整个Java J2EE应用程序的EdgeComputing网络。
- 分布式网络中维护可视性和控制内容的能力。这意味着提供强大的安全性,日志记录,SLA,诊断,报告和分析以及管理工具。这里,与内容传递本身一样,存在要克服的规模,可靠性和性能的挑战。
Delivery Networks as Virtual Networks
- Delivery Networks是一种虚拟网络,虚拟网络方法的优点在于它可以在现有的Internet上工作,不需要客户端软件,也不需要更改底层网络。而且,由于它几乎完全由软件构建,因此随着互联网的发展,它可以很容易地适应未来的需求。
- Akamai网络是一个非常大的分布式系统,由数以万计的全球部署服务器组成,这些服务器运行复杂的算法,以实现高度可扩展的分布式应用程序的交付。 我们可以将其视为由多个传送网络组成,每个传送网络都针对不同类型的内容(例如,静态Web内容,流媒体或动态应用程序)进行定制。 从较高层面来看,这些交付网络共享一个类似的架构,如图3所示,但每个系统组件的基础技术和实现可能不同,以便最好地适应特定类型的内容,流媒体或正在交付的应用程序。
- 工作流程
- Mapping system(arrow 1): URL -> map to the ip of an Edge Server to server the content.
- Edge Server(arrow 2): part of the edge server platform. Located in thousands of sites around the world. Processing requests from nearby users and serving the requested content
- Transport System: Edge server need to request content from an origin server. For instance, dynamic content on a web page that is customized for each user cannot be entirely cached by the edge platform.
- The communications and control system: disseminating status information, control messages, and configuration updates in a fault-tolerant and timely fashion.
- The data collection and analysis system: collecting and processing data from various sources such as server logs, client logs, and network and server information. The collected data can be used for monitoring, alerting, analytics, reporting, and billing.
- The management portal serves:
- provides a configuration management platform that allows an enterprise customer to retain fine-grained control. These configurations are updated across the edge platform from the management portal via the communications and control system
- provides the enterprise with visibility on how their users are interacting with their applications and content, including reports on audience demographics and traffic metrics.
- Design Principles
假设故障在CDN中是频繁发生的。这个理念决定每个级别的设计决策。比如用强大的商用服务器比具有大量硬件冗余的更昂贵的服务器更有意义。
- Reliability: 确保组件的完全冗余(没有单点故障),构建多级容错,并使用PAXOS [26]和分散式领导者选举等协议来适应系统组件故障的可能性。
- Scalability: 全球有超过60,000台机器(并且不断增长),所有平台组件必须具有高度可扩展性。 在基本级别,扩展意味着处理更多流量,内容和客户。 这也转化为处理越来越多的必须收集和分析的结果数据,以及构建必须支持越来越多的分布式机器的通信,控制和映射系统。
- Limit the necessity for human management: 因此,系统必须能够响应故障,处理负载和容量的变化,自我调整性能,并以最少的人为干预安全地部署软件和配置更新。
- Performance: 不仅从改善最终用户响应时间的角度,而且从整个平台的许多不同指标,如缓存命中率和网络资源利用率。 这项工作的一个额外好处是能源效率; 例如,内核和其他软件优化可以使用更少的计算机实现更大的容量和更多的流量。