Showing posts with label Python. Show all posts
Showing posts with label Python. Show all posts

Saturday, December 15, 2012

python自动字符集检测chardet


chardet下载地址:http://chardet.feedparser.org/
也可以使用pip安装:pip install chardet

测试:


In [1]: import chardet

In [2]: a = "我"

In [3]: chardet.detect(a)
Out[3]: {'confidence': 0.505, 'encoding': 'utf-8'}

Pragmatic Unicode(转)


This is a presentation I gave at PyCon 2012. You can read the slides and text on this page, or open the actual presentation in your browser, or watch the video:
Also, clicking the slide images will jump into the full presentation at that point. The Symbola font is included, but will have to be downloaded before some of the special symbols will appear.
Pragmatic Unicode~ or ~How Do I Stop the Pain?
Hi, I'm Ned Batchelder. I've been writing in Python for over ten years, which means at least a half-dozen times, I've made the same Unicode mistakes that everyone else has.
The past
If you're like most Python programmers, you've done it too: you've built a nice application, and everything seemed to be going fine. Then one day an accented character appeared out of nowhere, and your program started belching UnicodeErrors.
You kind of knew what to do with those, so you added an encode or a decode where the error was raised, but the UnicodeError happened somewhere else. You went to the new place, and added a decode, maybe an encode. After playing whack-a-mole like this for a while, the problem seemed to be fixed.
Then a few days later, another accent appeared in another place, and you had to play a little bit more whack-a-mole until the problem finally stopped.
You
So now you have a program that works, but you're annoyed and uncomfortable, it took too long, you know it isn't "right," and you hate yourself. And the main thing you know about Unicode is that you don't like Unicode.
You don't want to know about weirdo character sets, you just want to be able to write a program that doesn't make you feel bad.
This talk
You don't have to play whack-a-mole. Unicode isn't simple, but it isn't difficult either. With knowledge and discipline, you can deal with Unicode easily and with grace.
I'll teach you five Facts of Life, and give you three pro tips that will solve your Unicode problems. We're going to cover the basics of Unicode, and how both Python 2 and Python 3 work. They are different, but the strategies you'll use are basically the same.

The World & Unicode

The World & Unicode
We'll start with the basics of Unicode.
Bytes
The first Fact of Life: everything in a computer is bytes. Files on disk are a series of bytes, and network connections transmit only bytes. Almost without exception, all the data going into or out of any program you write, is bytes.
Bytes by themselves are meaningless, we need conventions to give them meaning.
ASCII
To represent text, we've been using the ASCII code for nearly 50 years. Every byte is assigned one of 95 symbols. When I send you a byte 65, you know that I mean a upper-case A.
ISO 8859-1
ISO Latin 1, or 8859-1, extended ASCII with 96 more symbols. This is pretty much the best you can do to represent text as single bytes, because there's not much room left to add more symbols.
Windows-1252
Windows added 27 more symbols to produce CP1252.
Tower of Babel
But Fact of Life #2 is that there are way more symbols in the world's text than 256. A single byte simply can't represent text world-wide. During your darkest whack-a-mole moments, you may have wished that everyone spoke English, but it simply isn't so. People need lots of symbols to communicate.
Fact of Life #1 and Fact of Life #2 together create a fundamental conflict between the structure of our computing devices, and the needs of the world's people.
Character codes
There have been a number of attempts to resolve this conflict. Single-byte character codes like ASCII map bytes to symbols, or characters. Each one pretends that Fact of Life #2 doesn't exist.
There are many single-byte codes, and they don't solve the problem. Each is only good for representing one small slice of human language. They can't solve the global text problem.
Character codes
People tried creating double-byte character sets, but they were still fragmented, serving different subsets of people. There were multiple standards in place, and ironically, they weren't large enough to deal with all the symbols needed.
Unicode
Unicode was designed to deal decisively with the issues with older character codes. Unicode assigns integers, known as code points, to characters. It has room for 1.1 million code points, of which 110,000 are already assigned, so there's plenty of room for future growth.
Unicode's goal is to have everything. It starts with ASCII, and includes thousands of symbols, including the famous Snowman, covers all the writing systems of the world, and is constantly being expanded. For example, the latest update gave us the symbol PILE OF POO.
Sample Unicode
Here is a string of six exotic Unicode characters. Unicode code points are written as 4-, 5-, or 6-digits of hex with a U+ prefix. Every character has an unambiguous full name which is always in uppercase ASCII.
Encodings
So Unicode makes room for all of the characters we could ever need, but we still have Fact of Life #1 to deal with: computers need bytes. We need a way to represent Unicode code points as bytes in order to store or transmit them.
The Unicode standard defines a number of ways to represent code points as bytes. These are called encodings.
UTF-8
UTF-8 is easily the most popular encoding for storage and transmission of Unicode. It uses a variable number of bytes for each code point, ASCII characters in particular are one byte each, using the same values as ASCII, so ASCII is a subset of UTF-8.
Here we show our exotic string as UTF-8. The ASCII characters H and I are single bytes, other characters use two or three bytes depending on their code point value. Some code points require four bytes, though we aren't using any of those here.

Python 2

Python 2
OK, enough theory, let's talk about Python 2.
Str vs Unicode
In Python 2, there are two different string data types. A plain-old string literal gives you a "str" object, which stores bytes. If you use a "u" prefix, you get a "unicode" object, which stores code points. In a unicode string literal, you can use backslash-u to insert any Unicode code point.
Notice that the word "string" is problematic. Both "str" and "unicode" are kinds of strings, and it's tempting to call either or both of them "string," but better to use more specific terms to keep things straight.
.encode() and .decode()
To convert between bytes and unicode, each has a method. Unicode strings have a .encode() method that produces bytes, and byte strings have a .decode() method that produces unicode. Each takes an argument, which is the name of the encoding to use for the operation.
We can define a Unicode string names my_unicode, and see that it has 9 characters. We can encode it to UTF-8 to create the my_utf8 byte string, which has 19 bytes. As you'd expect, decoding the UTF-8 string produces the original Unicode string.
Encoding errors
Unfortunately, encoding and decoding can produce errors if the data isn't appropriate for the specified encoding. Here we try to encode our exotic Unicode string to ASCII. It fails because ASCII can only represent charaters in the range 0 to 127, and our Unicode string has code points well outside that range.
The UnicodeEncodeError that's raised indicates the encoding being used, in the form of the "codec", for coder/decoder, and the actual position of the character that caused the problem.
Decoding errors
Decoding can also produce errors. Here we try to decode our UTF-8 string as ASCII and get a UnicodeDecodeError because again, ASCII can only accepts values up to 127, and our UTF-8 string has bytes outside that range.
Even UTF-8 can't decode any sequence of bytes. Here we try to decode some random junk, and it also produces a UnicodeDecodeError. Actually, one of UTF-8's advantages is that there are invalid sequences of bytes, which helps to build robust systems: mistakes in data won't be accepted as if they were valid.
Error handling
When encoding or decoding, you can specify what should happen when the codec can't handle the data. An optional second argument to encode or decode specifies the policy. The default value is "strict", which means raise an error, as we've seen.
A value of "replace" means, give me a standard replacement character. When encoding, the replacement character is a question mark, so any code point that can't be encoded using the specified encoding will simply produce a "?".
Other error handlers are more useful. "xmlcharrefreplace" produces an HTML/XML character entity reference, so that \u01B4 becomes "ƴ" (hex 01B4 is decimal 436.) This is very useful if you need to output unicode for an HTML file.
Notice that different error policies are used for different reasons. "Replace" is a defensive mechanism against data that cannot be interpreted, and loses information. "Xmlcharrefreplace" preserves all the original information, and is used when outputting data where XML escapes are acceptable.
Error handling
You can also specify error handling when decoding. "Ignore" will drop bytes that can't decode properly. "Replace" will insert a Unicode U+FFFD, "REPLACEMENT CHARACTER" for problem bytes. Notice that since the decoder can't decode the data, it doesn't know how many Unicode characters were intended. Decoding our UTF-8 bytes as ASCII produces 16 replacement characters, one for each byte that couldn't be decoded, while those bytes were meant to only produce 6 Unicode characters.
Implicit conversion
Python 2 tries to be helpful when working with unicode and byte strings. If you try to perform a string operation that combines a unicode string with a byte string, Python 2 will automatically decode the byte string to produce a second unicode string, then will complete the operation with the two unicode strings.
For example, we try to concatenate a unicode "Hello " with a byte string "world". The result is a unicode "Hello world". On our behalf, Python 2 is decoding the byte string "world" using the ASCII codec. The encoding used for these implicit decodings is the value of sys.getdefaultencoding().
The implicit encoding is ASCII because it's the only safe guess: ASCII is so widely accepted, and is a subset of so many encodings, that it's unlikely to be wrong.
Implicit decoding errors
Of course, these implicit decodings are not immune to decoding errors. If you try to combine a byte string with a unicode string and the byte string can't be decoded as ASCII, then the operation will raise a UnicodeDecodeError.
This is the source of those painful UnicodeErrors. Your code mixes unicode strings and byte strings, and as long as the data is all ASCII, the implicit conversions silently succeed. Once a non-ASCII character finds its way into your program, an implicit decode will fail, causing a UnicodeDecodeError.
Python 2 is “helpful”
Python 2's philosophy was that unicode strings and byte strings are confusing, and it tried to ease your burden by automatically converting between them, just as it does for ints and floats. But the conversion from int to float can't fail, while byte string to unicode string can.
Python 2 silently glosses over byte to unicode conversions, making it much easier to write code that deals with ASCII. The price you pay is that it will fail with non-ASCII data.
Other implicit conversions
There are lots of ways to combine two strings, and all of them will decode bytes to unicode, so you have to watch out for them.
First we use an ASCII format string, with unicode data. The format string will be decoded to unicode, then the formatting performed, resulting in a unicode string.
Next we switch the two: A unicode format string and a byte string again combine to produce a unicode string, because the byte string data is decoded as ASCII.
Simply attempting to print a unicode string will cause an implicit encoding: output is always bytes, so the unicode strings has to be encoded into bytes before it can be printed.
The next one is truly confusing: we ask to encode a byte string to UTF-8, and get an error about not being about to decode as ASCII! The problem here is that byte strings can't be encoded: remember encode is how you turn unicode into bytes. So to perform the encoding you want, Python 2 needs a unicode string, which it tries to get by implicitly decoding your bytes as ASCII.
Lastly, we encode an ASCII string to UTF-8. Here we're performing the same implicit decode to get a unicode string we can encode, but since the string is ASCII, it succeeds, and then goes on to encode it as UTF-8, producing the original byte string, since ASCII is a subset of UTF-8.
Bytes and Unicode
This is the most important Fact of Life: bytes and unicode are both important, and you need to deal with both of them. You can't pretend that everything is bytes, or everything is unicode. You need to use each for their purpose, and explicitly convert between them as needed.

Python 3

Python 3
We've seen the source of Unicode pain in Python 2, now let's take a look at Python 3. The biggest change from Python 2 to Python 3 is their treatment of Unicode.
Str vs bytes
Just as in Python 2, Python 3 has two string types, one for unicode and one for bytes, but they are named differently.
Now the "str" type that you get from a plain string literal stores unicode, and the "bytes" types stores bytes. You can create a bytes literal with a b prefix.
So "str" in Python 2 is now called "bytes," and "unicode in Python 2 is now called "str". This makes more sense than the Python 2 names, since Unicode is how you want all text stored, and byte strings are only for when you are dealing with bytes.
No coercion!
The biggest change in the Unicode support in Python 3 is that there is no automatic decoding of byte strings. If you try to combine a byte string with a unicode string, you will get an error all the time, regardless of the data involved!
All of those operations I showed where Python 2 silently converted byte strings to unicode strings to complete an operation, every one of them is an error in Python 3.
In addition, Python 2 considers a Unicode string and a bytes string equal if they contain the same ASCII bytes, and Python 3 won't. A consequence of this is that Unicode dictionary keys can't be found with byte strings, and vice-versa, as they can be in Python 2.
Python 3 pain
This drastically changes the nature of Unicode pain in Python 3. In Python 2, mixing Unicode and bytes succeeds so long as you only use ASCII data. In Python 3, it fails immediately regardless of the data.
So Python 2's pain is deferred: you think your program is correct, and find out later that it fails with exotic characters.
With Python 3, your code fails right off the bat, so even if you are only dealing with ASCII, you have to explicitly deal with the difference between bytes and Unicode.
Python 3 is strict about the difference between bytes and unicode. You are forced to be clear in your code which you are dealing with. This has been controversial.
Reading files
One of the changes in Python 3 because of this new strictness is how files are read. Python has always had two modes for reading files: binary and text. In Python 2, it only affected the line endings, and on Unix platforms, even that was a no-op.
In Python 3, the two modes produce different results. When you open a file in text mode, either with "r", or by defaulting the mode entirely, the data read from the file is implicitly decoded into Unicode, and you get str objects.
If you open a file in binary mode, by supplying "rb" as the mode, then the data read from the file is bytes, with no processing done on them.
The implicit conversion from bytes to unicode uses the encoding returned from locale.getpreferredencoding(), and it may not give you the results you expect. For example, when we read hi_utf8.txt, it's being decoded using the locale's preferred encoding, which since I created these samples on Windows, is "cp1252". Like ISO 8859-1, CP-1252 is a one-byte character code that will accept any byte value, so it will never raise a UnicodeDecodeError. That also means that it will happily decode data that isn't actually CP-1252, and produce garbage.
To get the file read properly, you should specify an encoding to use. The open() function now has an optional encoding parameter.

Pain relief

Pain relief
OK, so how do we deal with all this pain? The good news it that the rules to remember are simple, and they're the same for Python 2 or 3.
Pro tip #1: Unicode sandwich
As we saw with Fact of Life #1, the data coming into and going out of your program must be bytes. But you don't need to deal with bytes on the inside of your program. The best strategy is to decode incoming bytes as soon as possible, producing unicode. You use Unicode throughout your program, and then when outputting data, encode it to bytes as late as possible.
This creates a Unicode sandwich: bytes on the outside, Unicode on the inside.
Keep in mind that sometimes, a library you're using may do some of these conversions for you. The library may present you with Unicode input, or will accept Unicode for output, and the library will take care of the edge conversion to and from bytes. For example, Django provides Unicode, as does the json module.
Pro tip #2: Know what you have
The second rule is, you have to know what kind of data you are dealing with. At any point in your program, you need to know whether you have a byte string or a unicode string. This shouldn't be a matter of guessing, it should be by design.
In addition, if you have a byte string, you should know what encoding it is if you ever intend to deal with it as text.
When debugging your code, you can't simply print a value to see what it is. You need to look at the type, and you may need to look at the repr of the value in order to get to the bottom of what data you have.
Encoding is out-of-band
I said you have to understand what encoding your byte strings are. Here's Fact of Life #4: You can't determine the encoding of a byte string by examining it. You need to know through other means. For example, many protocols include ways to specify the encoding. Here we have examples from HTTP, HTML, XML, and Python source files. You may also know the encoding by prior arrangement, for example, the spec for a data source may specify the encoding.
There are ways to guess at the encoding of the bytes, but they are just guesses. The only way to be sure of the encoding is to find it out some other way.
Poo happens
Here's an example of our exotic Unicode string, encoded as UTF-8, and then mistakenly decoded in a variety of encodings. As you can see, decoding with an incorrect encoding might succeed, but produce the wrong characters. Your program can't tell it's decoding wrong, only when people try to read the text will you know something has gone wrong.
This is a good demonstration of Fact of Life #4: the same stream of bytes is decodable using a number of different encodings. The bytes themselves don't indicate what encoding they use.
BTW, there's a term for this garbage display, from the Japanese who have been dealing with this for years and years: Mojibake.
Data is dirty
Unfortunately, because the encoding for bytes has to be communicated separately from the bytes themselves, sometimes the specified encoding is wrong. For example, you may pull an HTML page from a web server, and the HTTP header claims the page is 8859-1, but in fact, is encoded with UTF-8.
In some cases, the encoding mismatch will succeed and cause mojibake. Other times, the encoding is invalid for the bytes, and will cause a UnicodeError.
Pro tip #3: Test Unicode
It should go without saying, but you should explicitly test your Unicode support. To do this, you need challenging Unicode data to pump through your code. If you are an English-only speaker, you may have a problem doing this, because lots of Unicode data is hard to read. Luckily, the variety of Unicode code points mean you can construct complex Unicode strings that are still readable by English speakers.
Here's an example of overly-accented text, readable pseudo-ASCII text, and upside-down text. One good source of these sorts of strings are various web sites that offer strings like this for teenagers to paste into social networking sites.
More Unicode
Depending on your application, you may need to dig deeper into the other complexities in the Unicode world. There are many details that I haven't covered here, and they can be very involved. I call this Fact of Life #5½ because you may not have to deal with any of this.
Facts of Life
To review, these are the five unavoidable Facts of Life:
  1. All input and output of your program is bytes.
  2. The world needs more than 256 symbols to communicate text.
  3. Your program has to deal with both bytes and Unicode.
  4. A stream of bytes can't tell you its encoding.
  5. Encoding specifications can be wrong.
Pro tips
These are the three Pro Tips to keep in mind as you build your software to keep your code Unicode-clean:
  1. Unicode sandwich: keep all text in your program as Unicode, and convert as close to the edges as possible.
  2. Know what your strings are: you should be able to explain which of your strings are Unicode, which are bytes, and for your byte strings, what encoding they use.
  3. Test your Unicode support. Use exotic strings throughout your test suites to be sure you're covering all the cases.
If you follow these tips, you'll write good solid code that deals well with Unicode, and won't fall over no matter how wild the Unicode it encounters.
See also
Other resources you might find helpful:
Joel Spolsky wrote The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), which covers how Unicode works and why. It has no Python-specific information, but is better written than this talk!
If you need to deal with the semantics of arbitrary Unicode characters, the unicodedata module in the Python standard library has functions that can help.
For testing Unicode, the various "fancy" text generators for use on social networks work great.
Thank You

Saturday, December 1, 2012

pyodbc在mac os x上连接sqlite3的配置

pyodbc在mac os x上连接sqlite3的配置

其实python上有直接的sqlite3连接库,但是因为开发的产品还需要连接sql server需要统一采用pyodbc进行数据库交换。

在mac os x上配置odbc和linux差不多,总得来说比windows 麻烦多了。
先说说需要的东西:
1、pyodbc只是一个python库,实现了python对odbc的调用,但真正的odbc连接需要有odbc驱动进行,常见的有:unixODBC、iODBC等。因为在mac os x直接有图形界面的ODBC Administrator使用,所以我就直接安装ODBC Administrator,注意好像早期的版本是自带这个工具的,但我的系统版本是10.8的需要自己到苹果的网站上下载安装安装地址:http://support.apple.com/kb/DL895
2、sqlite3最初安装只支持最基本的程序调用不提供odbc连接,所以需要另行下载odbc的驱动,地址:http://www.ch-werner.de/sqliteodbc/ 这里有几个版本,根据自己的需要选择。因为只支持到10.7所以我使用的版本是:sqlite3-odbc-0.93.dmg

pyodbc直接使用eszy_install或者pip进行安装就行了,如果在安装过程中报
我的安装方法是:
1、安装setuptool,下载地址:http://pypi.python.org/pypi/setuptools
2、安装pip,安装方法:eszy_install pip
3、安装pyodbc,安装方法:pip install pyodbc
     我使用pip在安装pyodbc的时候碰到了一个错误:clong报的文件或目录没有找到,如果碰到这个问题是因为pip需要使用clong来进行编译而clong在早期是集成在xcode中的,但现在新版本需要自己在xcode中安装。安装方法:xode -> Preferences -> Downloads 直接安装Components中的Command line tools就行了。
4、下载安装sqlite3-odbc,直接进行安装就可以了,安装完会在/usr/lib/下生成一个libsqlite3odbc-0.93.dylib文件,这里的文件名可能因为版本有点差异。但一定好记得前缀是libsqlite3odbc,因为被参考文章1的辅导一直使用/usr/local/lib下的文件libsqlite3.dylib进行配置导致走了很多弯路。
5、下载安装ODBC Administrator,安装完后在Applications目录下打开该工具进行配置。
    配置方式:
    1)配置驱动,切换到Drivers标签下点击Add添加新驱动,Description填写驱动名称,如:sqliteodbc,Driver和Setup File都填入/usr/lib/libsqlite3odbc-0.93.dylib(文件名可能因为版本不一样有差异)。单击OK保存。
    2)  在终端下使用sqlite3 test.db创建一个库,假设路径为:/tmp/test.db
    3)  回到ODBC Administrator,在User DSN下点击添加选择第1步创建的Driver sqliteodbc,然后在DSN中输入进行odbc连接的名称,如:testdb,Description根据需要进行描述,接着点击左下角的Add按键添加一个键值对Key/Value,在Key中填入:Database,在Values中填入db文件的路径,我们这个例子是: /tmp/test.db()。点击OK保存后点击Apply。这些配置完后我们可以使用终端在~/Library/ODBC目录下看到两个文件:odbc.ini和odbcinst.ini,这两个配置文件保存的就是我们之前配置ODBC信息。

6、到这里ODBC已经算配置完成了,接下来可以进行连接验证
先为sqlite3准备数据:

$sqlite3 test.db
SQLite version 3.7.12 2012-04-03 19:43:07
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .databases
seq name file
--- --------------- ----------------------------------------------------------
0 main /Users/trams/test.db
sqlite> .tables
sqlite> create table test(id int, name varchar(50));
sqlite> insert into test(id, name) values(1, 'myname');
sqlite> select * from test;
1|myname
sqlite> .mode columns
sqlite> .header on
sqlite> select * from test;
id name
---------- ----------
1 myname
sqlite>


下面使用ipython进行连接测试(ipython可通过pip进行安装:pip install ipython)
In [4]: import pyodbc
In [5]: conn = pyodbc.connect("DSN=testdb")
In [6]: cursor = conn.cursor()
In [7]: print cursor
<pyodbc.Cursor object at 0x10df0bf30>
In [8]: result = cursor.execute("select * from test")
In [9]: for i in result.fetchall():
....: print i
....:
(1, 'myname')
In [10]:


OK!看到结果了。。希望上面写的东西对大家有帮助。


参考文章:
Connecting to sqlite database via ODBC on OS X: http://www.islandjohn.com/2009/02/13/connecting-to-an-sqlite-database-via-odbc-on-os-x/
sqlite3 odbc的配置:http://www.ch-werner.de/sqliteodbc/html/index.html
Pyodbc connect to sqlite odbc: http://billyjin.kodingen.com/punbb-1.3.4/viewtopic.php?id=377