欧美国产日韩在线观看,一区二区国产精品,欧美手机在线

話不多說，奉上代碼。

#/usr/bin/env python
#coding=utf8

#對提取的數據進行預處理

def pretreat(infile,outfile):
  rfile = open(infile,'r')
  wfile = open(outfile,'wa+')
  while(1):
    line = rfile.readline()
    if not line:
      break
    line = line.split('>')

    #數據的長度，避免重復計算
    lens = len(line)

    #獲得有效信息
    for i in range(lens):
      line[i] = line[i].split('/')
    for i in range(lens):
      #處理三元組第三個元素
      #print line[i]
      flag = 0
      if '@zh' in line[i][0]:
        line[i][0] = line[i][0].replace('@zh .','')
        line[i][0] = line[i][0].replace('／','')
      if '^^<http:' in line[i][0]:
        flag = 1
        line[i][0] = line[i][0].replace('^^<http:','')
        line[i][0] = line[i][0].replace('／','')
        print line[i][0]
        wfile.write(line[i][0].strip())
      if len(line[i]) >= 1 and i != 3 and 0 == flag:
        if '／' in line[i][len(line[i])-1]:
          line[i][len(line[i])-1] = line[i][len(line[i])-1].replace('／','')
        wfile.write(line[i][len(line[i])-1].strip()+' ')
    wfile.write('\n')
  wfile.close()

#判斷是否含有字母
def is_alphabet(input):
  input = unicode(input,"utf-8")
  buf = []
  for uchar in input:
    if (uchar >= u'\u0041' and uchar<=u'\u005a') or (uchar >= u'\u0061' and uchar<=u'\u007a'):
      return True
    else:
      return False

  #去除國家名中含有字母的三元組
def removealp(infile,outfile):
  rfile = open(infile,'r')
  wfile = open(outfile,'w')
  while(1):
    line = rfile.readline()
    if not line:
      break
    linetmp = line
    line = line.split(' ')
    if False == is_alphabet(line[0]):
      wfile.write(linetmp)
  wfile.close()

pretreat('article_categories_en_uris_zh.nt','tag_article_categories_en_uris_zh.txt')

posted on 2012-09-13 17:29 SunRise_at 閱讀(1420) 評論(0) 編輯收藏引用所屬分類: 可愛的python

只有注冊用戶登錄后才能發表評論。
【推薦】100%開源！大型工業跨平臺軟件C++源碼提供，建模，組態！

相關文章: turbogear2上傳文件功能關于PIL庫的一些概念 python的默認參數 Google Translate API json的編碼和解析 python多線程 python編碼轉換 Python yield 用法 python enumerate用法 python之Queue

網站導航: 博客園 IT新聞 BlogJava 博問 Chat2DB 管理

sunrise

常用鏈接

留言簿(12)

隨筆分類(63)

隨筆檔案(64)

收藏夾

ACMer

技術聯盟

可愛的python

數據挖掘

算法之道

友情鏈接

最新隨筆

搜索

積分與排名

最新隨筆

最新評論

閱讀排行榜

評論排行榜