來自http://www.usenix.org/events/osdi10/tech/full_papers/Geambasu.pdf
分布式key-value存儲系統有很多,但comet是分布式“活動”key-value存儲系統,特點如下:
1.在普通的key-value上做了一些回調機制,當存取key-value對象時,回調相應handler函數,從而實現邏輯控制。
已實現的回調有onGet,onPut,onUpdate,onTimer 。
2.handler函數是由lua語言實現,comet內部集成了lua解析器的簡化版,加了很多限制,形成lua代碼運行的安全沙箱。
其他方面可參考論文。
Numbers Everyone Should Know

Google AppEngine Numbers
This group of numbers is from Brett Slatkin in Building Scalable Web Apps with Google App Engine.
Writes are expensive!
* The size and shape of your data
* Doing work in batches (batch puts and gets)
Reads are cheap!
* For a 1MB entity, that's 4000 fetches/sec
Numbers Miscellaneous
This group of numbers is from a presentation Jeff Dean gave at a Engineering All-Hands Meeting at Google.The Lessons
The Techniques
Keep in mind these are from a Google AppEngine perspective, but the ideas are generally applicable.Sharded Counters
We always seem to want to keep count of things. But BigTable doesn't keep a count of entities because it's a key-value store. It's very good at getting data by keys, it's not interested in how many you have. So the job of keeping counts is shifted to you.The naive counter implementation is to lock-read-increment-write. This is fine if there a low number of writes. But if there are frequent updates there's high contention. Given the the number of writes that can be made per second is so limited, a high write load serializes and slows down the whole process.
The solution is to shard counters. This means:
This approach seems counter-intuitive because we are used to a counter being a single incrementable variable. Reads are cheap so we replace having a single easily read counter with having to make multiple reads to recover the actual count. Frequently updated shared variables are expensive so we shard and parallelize those writes.
With a centralized database letting the database be the source of sequence numbers is doable. But to scale writes you need to partition and once you partition it becomes difficult to keep any shared state like counters. You might argue that so common a feature should be provided by GAE and I would agree 100 percent, but it's the ideas that count (pun intended).
Paging Through Comments
How can comments be stored such that they can be paged through
in roughly
the order they were entered?
Under a high write load situation this is a
surprisingly hard question to answer. Obviously what you want is just a counter.
As a comment is made you get a sequence number and that's the order comments are
displayed. But as we saw in the last section shared state like a single counter
won't scale in high write environments.
A sharded counter won't work in
this situation either because summing the shared counters isn't transactional.
There's no way to guarantee each comment will get back the sequence number it
allocated so we could have duplicates.
Searches in BigTable return data
in alphabetical order. So what is needed for a key is something unique and
alphabetical so when searching through comments you can go forward and backward
using only keys.
A lot of paging algorithms use counts. Give me records
1-20, 21-30, etc. SQL makes this easy, but it doesn't work for BigTable.
BigTable knows how to get things by keys so you must make keys that return data
in the proper order.
In the grand old tradition of making unique keys we
just keep appending stuff until it becomes unique. The suggested key for GAE is:
time stamp + user ID + user comment ID.
Ordering by date
is obvious. The good thing is getting a time stamp is a local decision, it
doesn't rely on writes and is scalable. The problem is timestamps are not
unique, especially with a lot of users.
So we can add the user name to
the key to distinguish it from all other comments made at the same time. We
already have the user name so this too is a cheap call.
Theoretically
even time stamps for a single user aren't sufficient. What we need then is a
sequence number for each user's comments.
And this is where the GAE
solution turns into something totally unexpected. Our goal is to remove write
contention so we want to parallelize writes. And we have a lot available storage
so we don't have to worry about that.
With these forces in mind, the
idea is to create a counter per user. When a user adds a comment it's added to a
user's comment list and a sequence number is allocated. Comments are added in a
transactional context on a per user basis using Entity Groups. So each comment
add is guaranteed to be unique because updates in an Entity Group are
serialized.
The resulting key is guaranteed unique and sorts properly in
alphabetical order. When paging a query is made across entity groups using the
ID index. The results will be in the correct order. Paging is a matter of
getting the previous and next keys in the query for the current page. These keys
can then be used to move through index.
I certainly would have never
thought of this approach. The idea of keeping per user comment indexes is out
there. But it cleverly follows the rules of scaling in a distributed system.
Writes and reads are done in parallel and that's the goal. Write contention is
removed.
linux下的prctl庫自kernel 2.6.9后支持
PR_SET_NAME選項,用于設置進程名字
,linux的進程一般使用lwp,所以這個函數可以設置線程名字。api定義如下
PR_SET_NAME (since Linux 2.6.9)
Set the process name for the calling process, using the value in the location pointed to by (char *) arg2. The name can be up to 16 bytes long, and should be null-terminated if it contains fewer bytes.
PR_GET_NAME (since Linux 2.6.11)
Return the process name for the calling process, in the buffer pointed to by (char *) arg2. The buffer should allow space for up to 16 bytes; the returned string will be null-terminated if it is shorter than that.
簡單實現代碼:

{
char title [16] ={0};
va_list ap;
va_start(ap, fmt);
vsnprintf (title, sizeof (title) , fmt, ap);
va_end (ap);
return prctl(PR_SET_NAME,title) ;
}
現在能夠為線程設置名字了,那么如何看到呢
top -H
1.所有的key數據放在內存中,通過hashmap組織,便于快速查找,內存中同時存放了key所對應數據在磁盤上的文件指針,直接定位數據。
2.磁盤數據使用追加寫的方式,充分利用磁盤適合順序存取的特點,每次數據更新會寫入磁盤文件,同時更新索引。
3.讀數據時根據索引直接定位,利用文件系統的cache機制,bitcask不再單獨實現cache機制。
4.由于更新會寫入新位置,老位置的數據會定期清理合并,減少占用的磁盤空間。
5.讀寫的并發控制使用向量時鐘(vector clock)。
6.內存中的索引數據也會刷新到單獨的索引文件,這樣重啟時不需要重建全部索引。
http://highscalability.com/blog/2011/1/10/riaks-bitcask-a-log-structured-hash-table-for-fast-keyvalue.html
1.現代的操作系統對于內存管理,磁盤讀寫有復雜的優化機制,以提高系統的整體性能,開發用戶空間的程序時需要關注、配合這些機制,以squid為例,內部實現了對象的緩存、淘汰策略,其實現跟操作系統類似,比如被訪問的對象會被緩存,冷對象會刷到磁盤,釋放內存,在一些情況下,這種機制可能跟操作系統沖突,從而并不能達到預期。當squid緩存的內存對象一段時間內未被訪問,并且還未被squid刷到磁盤時,操作系統可能因為內存不足將這些冷對象swap到磁盤,此時squid是不知道的,而一直認為這些冷對象還在內存中,然后squid根據淘汰策略將這些冷對象刷到磁盤時,操作系統需要先把這些冷對象從swap中重新載入內存,squid接著將這些冷對象寫入磁盤。可以看出整個過程的性能損耗。
評注:這個例子需要一分為二的看,應用程序的內存對象被系統swap,說明系統已經內存不夠了,內存cache效率大打折扣。
2.帶持久化的cache,需要從持久化的數據中重構cache,一般有兩種方法,一種是直接從磁盤中按需讀取,由于訪問是隨機的,而磁盤的隨機讀效率很低,這種方式訪問效率不高但是節省空間,適合低流量的小機器,大數據量的cache。另外一種方法是預先從磁盤中建立完整的索引,能夠大大提升訪問效率。
持久化緩存和磁盤不同的是持久化緩存對可靠性要求不高,不需要嚴格的崩潰恢復,varnish使用了第二種方式,通過分層的保護提升可靠性,頂層通過A/B寫保證可靠性。底層具體數據不保證可靠性。
http://www.varnish-cache.org/trac/wiki/ArchitectNotes
消息中間件kafka簡介
目的及應用場景
Kafka是linkedin的分布式消息系統,設計側重高吞吐量,用于好友動態,相關性統計,排行統計,訪問頻率控制,批處理等系統。
傳統的離線分析方案是使用日志文件記錄數據,然后集中批量處理分析。這種方式對于實時性要求很高的活動流數據不適合,而大部分的消息中間件能夠處理實時性要求高的消息/數據,但是對于隊列中大量未處理的消息/數據在持久性方面比較弱。
設計理念
持久化消息
高吞吐量
consumer決定消息狀態
系統中各個角色都是分布式集群
consumer有邏輯組的概念,每個consumer進程屬于一個consumer組,每個消息會發給每個關注此消息的consumer組中的某一個consumer進程。
Linkedin使用了多個consumer組,每個組多個相同職責的consumer進程。
部署架構
消息持久化和緩存
Kafka使用磁盤文件做持久化,磁盤文件的讀寫速度在于如何使用,隨機寫比順序寫慢的多,現代os會在內存回收對性能影響不大的情況下盡量使用內存cache進行磁盤的合并寫。所以用戶進程再做一次緩存沒有太大必要。Kafka的讀寫都是順序的,以append方式寫入文件。
為減少內存copy,kafka使用sendfile發送數據,通過合并message提升性能。
Kafka不儲存每個消息的狀態,而使用(consumer,topic,partition)保存每個客戶端狀態,大大減小了維護每個消息狀態的麻煩。
在消息的推vs拉的選擇上,kafka使用拉的方式,因為推的方式會因為各個客戶端的處理能力、流量等不同產生不確定性。
負載均衡
Producers和brokers通過硬件做負載均衡,brokers和consumers都以集群方式運行,通過zookeeper協調變更和成員管理。
/proc/{pid}/下存放運行進程的所有相關數據,可以據此分析進程資源消耗和運行情況。
1./proc/{pid}/stat
進程運行統計
awk '{print $1,$2,$3,$14,$15,$20,$22,$23,$24}' stat
PID,COMM,STATE,UTIME(cpu ticks in user mode),STIME(cpu ticks in kernel mode),THREADS,START_TIME,VSIZE(virtual memory size),RSS(physical memory page)
2./proc/{pid}/status
包含stat的大部分數據,可讀性更強。
3./proc/{pid}/task/
各子線程的運行情況
4./proc/{pid}/fd/
進程打開的fd
5./proc/{pid}/io
進程IO統計
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.route.max_size = 4096000
net.core.somaxconn = 8192
net.ipv4.tcp_synack_retries = 1
net.ipv4.tcp_syn_retries = 1
net.ipv4.netfilter.ip_conntrack_max = 2621400
net.core.rmem_max = 20000000
ulimit -n 40960
ulimit -c unlimited
做個記號,有待增補完全。
通過系統調用fork的copy-on-write的方式實現內存的拷貝,保證刷數據時的一致性。
但是如果在刷數據期間數據發生大量變化,可能會造成內存的大量copy-on-write,引起系統內存拷貝的負載變化。
邏輯:
1.主進程調用fork 。
2.子進程關閉listen fd ,開始刷數據到存儲。
3.主進程調整策略,減少內存數據更改。
redis的這種策略并不能保證數據可靠性,沒有write ahead日志,異常情況數據可能會丟失。
因此redis加入了append only的日志文件,以保證數據可靠,但是每次數據更新都寫日志的做法使得日志文件增長很快,redis使用跟刷數據類似
的方式后臺整理這個日志文件。
注:目前的數據庫一般通過write ahead日志保證數據可靠性,但是這種日志也不是實時刷新,而是寫到buffer中,被觸發刷新到文件。